The OBITools3 version of the wolf tutorial made for the first OBITools.
1. Import the sequencing data in a DMS
Download this archive containing the reads and the ngs file:
And unzip it:
tar -zxvf wolf_tutorial.tar.gz
-
Import the first set of reads, with :
obi import --quality-solexa wolf_tutorial/wolf_F.fastq wolf/reads1
--quality-solexa
is the appropriate fastq quality option because it's an old dataset,wolf_tutorial/wolf_F.fastq
is the path to the file to import,wolf
is the path to the DMS that will be automatically created, andreads1
is the name of the view into which the file will be imported.Note: If some sequences fail to import, you might need to use the
--input-na-string
option to define the string for NA values. -
Import the second set of reads:
obi import --quality-solexa wolf_tutorial/wolf_R.fastq wolf/reads2
-
Import the ngsfilter file describing the primers and tags used for each sample:
obi import --ngsfilter-input wolf_tutorial/wolf_diet_ngsfilter.txt wolf/ngsfile
-
Check what is in the DMS:
obi ls wolf
The option -l displays more details:
obi ls -l wolf
You can also check just one view:
obi ls wolf/reads1
Or one column:
obi ls wolf/ngsfile/sample
To print the sequences, use the less command:
obi less wolf/reads1
2. Recover the full sequences from the partial forward and reverse reads
obi alignpairedend -R wolf/reads2 wolf/reads1 wolf/aligned_reads
Note: For any command, you can also print the result to standard output using the -
URI:
obi alignpairedend -R wolf/reads2 wolf/reads1 - > aligned_reads.fastq
3. Remove sequence records with a low overlap alignment score:
We can check the average alignment score (corresponding approximately to the similarity between the sequences in the overlap):
obi stats -a score_norm wolf/aligned_reads
And keep only sequences with a high overlap alignment score:
obi grep -p "sequence['score_norm'] > 0.8" wolf/aligned_reads wolf/good_sequences
4. Assign each sequence record to the corresponding sample/marker combination
obi ngsfilter -t wolf/ngsfile -u wolf/unidentified_sequences wolf/good_sequences wolf/identified_sequences
Note: Unlike the OBITools1, the OBITools3 make it possible to run ngsfilter before aligning the paired-end reads, BUT it is not recommended to do so for usual data, as ngsfilter will not be able to detect and cut out partially sequenced primers.
5. Dereplicate reads into unique sequences
obi uniq -m sample wolf/identified_sequences wolf/dereplicated_sequences
6. Denoise the sequence dataset
-
First let's clean the useless metadata and keep only the
COUNT
andmerged_sample
(count by sample) tags:obi annotate -k COUNT -k MERGED_sample wolf/dereplicated_sequences wolf/cleaned_metadata_sequences
-
Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp:
obi grep -p "len(sequence)>=80 and sequence['COUNT']>=10" wolf/cleaned_metadata_sequences wolf/denoised_sequences
-
Clean the sequences from PCR/sequencing errors (sequence variants):
obi clean -s MERGED_sample -r 0.05 -H wolf/denoised_sequences wolf/cleaned_sequences
7. Taxonomic assignment of the sequences
Build a reference database
Building the reference database is costly in time and disk space so you can simply download this already built one:
With the associated taxdump:
And import them (note that you could import them in another DMS):
obi import v05_refs.fasta.gz wolf/v05_refs
obi import --taxdump taxdump.tar.gz wolf/taxonomy/my_tax
You can then resume at the next part "Clean the database".
Otherwise, to build the database yourself from the start:
-
Download the sequences (except human and environmental samples):
mkdir EMBL cd EMBL wget -nH --cut-dirs=6 -A 'STD_*.dat.gz' -R 'STD_HUM*.dat.gz','STD_ENV*.dat.gz' -m -np ftp://ftp.ebi.ac.uk/pub/databases/ena/sequence/snapshot_latest/std/ cd ..
-
Import the sequences in the DMS:
obi import --embl EMBL wolf/embl_refs
For EMBL files, you can give the path to a directory with several EMBL files. Same with GenBank files.
Note: The reference database can be built in another DMS.
-
Download the taxonomy:
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
Check the md5 sum, as files downloaded from the NCBI are often corrupted on some networks. Compare https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5 with:
md5sum taxdump.tar.gz
-
Import the taxonomy in the DMS:
obi import --taxdump taxdump.tar.gz wolf/taxonomy/my_tax
Note: You can use
obi addtaxids
to annotate sequences with their NCBI taxids, guessed from the taxon name. You can also useobi taxonomy
to add your own taxa to a taxonomy. -
Use ecoPCR to simulate an in silico PCR with the V05 primers:
obi ecopcr -e 3 -l 50 -L 150 -F TTAGATACCCCACTATGC -R TAGAACAGGCTCCTCTAG --taxonomy wolf/taxonomy/my_tax wolf/embl_refs wolf/v05_refs
Clean the database
-
Filter sequences so that they have a good taxonomic description at the species, genus, and family levels:
obi grep --require-rank=species --require-rank=genus --require-rank=family --taxonomy wolf/taxonomy/my_tax wolf/v05_refs wolf/v05_refs_clean
-
Dereplicate identical sequences (note: not a necessary step, avoid for big databases as long as #79 is not fixed):
obi uniq --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_clean wolf/v05_refs_uniq
-
Ensure that the dereplicated sequences have a taxid at the family level (if you ran the previous step):
obi grep --require-rank=family --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_uniq wolf/v05_refs_uniq_clean
-
Build the reference database specifically used by the OBITools3 to make ecotag efficient by pre-computing the similarities between the reference sequences:
obi build_ref_db -t 0.97 --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_uniq_clean wolf/v05_db_97
Assign each sequence to a taxon
Once the reference database is built, taxonomic assignment can be done using the ecotag
command:
obi ecotag -m 0.97 --taxonomy wolf/taxonomy/my_tax -R wolf/v05_db_97 wolf/cleaned_sequences wolf/assigned_sequences
8. After the taxonomic assignment
Take a look at the results
For example:
obi stats -c SCIENTIFIC_NAME wolf/assigned_sequences
Align the sequences
obi align -t 0.95 wolf/assigned_sequences wolf/aligned_assigned_sequences
Check the history of everything that was done
The default history is in bash:
obi history wolf
The most visual way to look at the pipeline is:
obi history -d wolf > wolf.dot
obi history -d wolf/cleaned_sequences > wolf_one_view.dot
To look at the graph produced, open the dot file:
dot -Tx11 wolf.dot
or convert the dot file to a png image file:
dot -Tpng wolf.dot -o wolf.png
open wolf.png &
You will get something like this:
Export the results
Export in fasta format:
obi export --fasta-output wolf/assigned_sequences -o wolf_results.fasta
or:
obi export --fasta-output wolf/assigned_sequences > wolf_results.fasta
Export in csv-like format:
obi export --tab-output wolf/aligned_assigned_sequences > wolf_results.csv