Wolf tutorial with the OBITools3

The OBITools3 version of the wolf tutorial made for the first OBITools.

1. Import the sequencing data in a DMS

Download this archive containing the reads and the ngs file:

wolf_tutorial.tar.gz

And unzip it:

tar -zxvf wolf_tutorial.tar.gz

Import the first set of reads, with :
```
 obi import --quality-solexa wolf_tutorial/wolf_F.fastq wolf/reads1
```
--quality-solexa is the appropriate fastq quality option because it's an old dataset, wolf_tutorial/wolf_F.fastq is the path to the file to import, wolf is the path to the DMS that will be automatically created, and reads1 is the name of the view into which the file will be imported.

Note: If some sequences fail to import, you might need to use the --input-na-string option to define the string for NA values.

Import the second set of reads:

 obi import --quality-solexa wolf_tutorial/wolf_R.fastq wolf/reads2

Import the ngsfilter file describing the primers and tags used for each sample:

 obi import --ngsfilter-input wolf_tutorial/wolf_diet_ngsfilter.txt wolf/ngsfile

Check what is in the DMS:
```
 obi ls wolf
```
The option -l displays more details:
```
 obi ls -l wolf
```
You can also check just one view:
```
 obi ls wolf/reads1
```
Or one column:
```
 obi ls wolf/ngsfile/sample
```
To print the sequences, use the less command:
```
 obi less wolf/reads1
```

2. Recover the full sequences from the partial forward and reverse reads

obi alignpairedend -R wolf/reads2 wolf/reads1 wolf/aligned_reads

Note: For any command, you can also print the result to standard output using the - URI:

obi alignpairedend -R wolf/reads2 wolf/reads1 - > aligned_reads.fastq

3. Remove sequence records with a low overlap alignment score:

We can check the average alignment score (corresponding approximately to the similarity between the sequences in the overlap):

obi stats -a score_norm wolf/aligned_reads

And keep only sequences with a high overlap alignment score:

obi grep -p "sequence['score_norm'] > 0.8" wolf/aligned_reads wolf/good_sequences

4. Assign each sequence record to the corresponding sample/marker combination

obi ngsfilter -t wolf/ngsfile -u wolf/unidentified_sequences wolf/good_sequences wolf/identified_sequences

Note: Unlike the OBITools1, the OBITools3 make it possible to run ngsfilter before aligning the paired-end reads, BUT it is not recommended to do so for usual data, as ngsfilter will not be able to detect and cut out partially sequenced primers.

5. Dereplicate reads into unique sequences

obi uniq -m sample wolf/identified_sequences wolf/dereplicated_sequences

6. Denoise the sequence dataset

First let's clean the useless metadata and keep only the COUNT and merged_sample (count by sample) tags:

 obi annotate -k COUNT -k MERGED_sample wolf/dereplicated_sequences wolf/cleaned_metadata_sequences

Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp:

 obi grep -p "len(sequence)>=80 and sequence['COUNT']>=10" wolf/cleaned_metadata_sequences wolf/denoised_sequences

Clean the sequences from PCR/sequencing errors (sequence variants):

 obi clean -s MERGED_sample -r 0.05 -H wolf/denoised_sequences wolf/cleaned_sequences

7. Taxonomic assignment of the sequences

Build a reference database

Building the reference database is costly in time and disk space so you can simply download this already built one:

v05_refs.fasta.gz

With the associated taxdump:

taxdump.tar.gz

And import them (note that you could import them in another DMS):

	obi import v05_refs.fasta.gz wolf/v05_refs
	obi import --taxdump taxdump.tar.gz wolf/taxonomy/my_tax

You can then resume at the next part "Clean the database".

Otherwise, to build the database yourself from the start:

Download the sequences (except human and environmental samples):

 mkdir EMBL
 cd EMBL
 wget -nH --cut-dirs=6 -A 'STD_*.dat.gz' -R 'STD_HUM*.dat.gz','STD_ENV*.dat.gz' -m -np ftp://ftp.ebi.ac.uk/pub/databases/ena/sequence/snapshot_latest/std/ 

 cd ..

Import the sequences in the DMS:
```
 obi import --embl EMBL wolf/embl_refs
```
For EMBL files, you can give the path to a directory with several EMBL files. Same with GenBank files.

Note: The reference database can be built in another DMS.
Download the taxonomy:
```
 wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
```
Check the md5 sum, as files downloaded from the NCBI are often corrupted on some networks. Compare https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5 with:
```
 md5sum taxdump.tar.gz
```
Import the taxonomy in the DMS:
```
 obi import --taxdump taxdump.tar.gz wolf/taxonomy/my_tax
```
Note: You can use obi addtaxids to annotate sequences with their NCBI taxids, guessed from the taxon name. You can also use obi taxonomy to add your own taxa to a taxonomy.

Use ecoPCR to simulate an in silico PCR with the V05 primers:

 obi ecopcr -e 3 -l 50 -L 150 -F TTAGATACCCCACTATGC -R TAGAACAGGCTCCTCTAG --taxonomy wolf/taxonomy/my_tax wolf/embl_refs wolf/v05_refs

Clean the database

Filter sequences so that they have a good taxonomic description at the species, genus, and family levels:

 obi grep --require-rank=species --require-rank=genus --require-rank=family --taxonomy wolf/taxonomy/my_tax wolf/v05_refs wolf/v05_refs_clean

Dereplicate identical sequences (note: not a necessary step, avoid for big databases as long as #79 is not fixed):
```
 obi uniq --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_clean wolf/v05_refs_uniq
```

Ensure that the dereplicated sequences have a taxid at the family level (if you ran the previous step):

 obi grep --require-rank=family --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_uniq wolf/v05_refs_uniq_clean

Build the reference database specifically used by the OBITools3 to make ecotag efficient by pre-computing the similarities between the reference sequences:
```
 obi build_ref_db -t 0.97 --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_uniq_clean wolf/v05_db_97
```

Assign each sequence to a taxon

Once the reference database is built, taxonomic assignment can be done using the ecotag command:

obi ecotag -m 0.97 --taxonomy wolf/taxonomy/my_tax -R wolf/v05_db_97 wolf/cleaned_sequences wolf/assigned_sequences

8. After the taxonomic assignment

Take a look at the results

For example:

obi stats -c SCIENTIFIC_NAME wolf/assigned_sequences

Align the sequences

obi align -t 0.95 wolf/assigned_sequences wolf/aligned_assigned_sequences

Check the history of everything that was done

The default history is in bash:

obi history wolf

The most visual way to look at the pipeline is:

obi history -d wolf > wolf.dot
obi history -d wolf/cleaned_sequences > wolf_one_view.dot

To look at the graph produced, open the dot file:

dot -Tx11 wolf.dot

or convert the dot file to a png image file:

dot -Tpng wolf.dot -o wolf.png
open wolf.png &

You will get something like this:

Export the results

Export in fasta format:

obi export --fasta-output wolf/assigned_sequences -o wolf_results.fasta

or:

obi export --fasta-output wolf/assigned_sequences > wolf_results.fasta

Export in csv-like format:

obi export --tab-output wolf/aligned_assigned_sequences > wolf_results.csv

Contact

celine.mercier@metabarcoding.org