|
|
|
The OBITools3 version of the [wolf tutorial](https://pythonhosted.org/OBITools/wolves.html) made for the first OBITools.
|
|
|
|
|
|
|
|
### 1. Import the sequencing data in a DMS
|
|
|
|
|
|
|
|
Download this archive containing the reads and the ngs file:
|
|
|
|
|
|
|
|
[wolf_tutorial.tar.gz](/uploads/9b86f67ad05815ddee14526640d81137/wolf_tutorial.tar.gz)
|
|
|
|
|
|
|
|
And unzip it:
|
|
|
|
|
|
|
|
tar -zxvf wolf_tutorial.tar.gz
|
|
|
|
|
|
|
|
|
|
|
|
1. Import the first set of reads, with :
|
|
|
|
|
|
|
|
obi import --quality-solexa wolf_tutorial/wolf_F.fastq wolf/reads1
|
|
|
|
|
|
|
|
`--quality-solexa` is the appropriate fastq quality option because it's an old dataset, `wolf_tutorial/wolf_F.fastq` is the path to the file to import, `wolf` is the path to the DMS that will be automatically created, and `reads1` is the name of the view into which the file will be imported.
|
|
|
|
|
|
|
|
2. Import the second set of reads:
|
|
|
|
|
|
|
|
obi import --quality-solexa wolf_tutorial/wolf_R.fastq wolf/reads2
|
|
|
|
|
|
|
|
3. Import the [ngsfilter file](https://pythonhosted.org/OBITools/scripts/ngsfilter.html) describing the primers and tags used for each sample:
|
|
|
|
|
|
|
|
obi import --ngsfilter wolf_tutorial/wolf_diet_ngsfilter.txt wolf/ngsfile
|
|
|
|
|
|
|
|
4. Check what is in the DMS:
|
|
|
|
|
|
|
|
obi ls wolf
|
|
|
|
|
|
|
|
The option -l displays more details:
|
|
|
|
|
|
|
|
obi ls -l wolf
|
|
|
|
|
|
|
|
You can also check just one view:
|
|
|
|
|
|
|
|
obi ls wolf/reads1
|
|
|
|
|
|
|
|
Or one column:
|
|
|
|
|
|
|
|
obi ls wolf/ngsfile/sample
|
|
|
|
|
|
|
|
To print the sequences, use the less command:
|
|
|
|
|
|
|
|
obi less wolf/reads1
|
|
|
|
|
|
|
|
### 2. Recover the full sequences from the partial forward and reverse reads
|
|
|
|
|
|
|
|
obi alignpairedend -R wolf/reads2 wolf/reads1 wolf/aligned_reads
|
|
|
|
|
|
|
|
### 3. Remove unaligned sequence records
|
|
|
|
|
|
|
|
obi grep -a mode:alignment wolf/aligned_reads wolf/good_sequences
|
|
|
|
|
|
|
|
### 4. Assign each sequence record to the corresponding sample/marker combination
|
|
|
|
|
|
|
|
obi ngsfilter -t wolf/ngsfile -u wolf/unidentified_sequences wolf/good_sequences wolf/identified_sequences
|
|
|
|
|
|
|
|
Note: Unlike the OBITools1, the OBITools3 make it possible to run ngsfilter before aligning the paired-end reads, BUT it is not recommended to do so for usual data, as ngsfilter will not be able to detect and cut out partially sequenced primers.
|
|
|
|
|
|
|
|
### 5. Dereplicate reads into unique sequences
|
|
|
|
|
|
|
|
obi uniq -m sample wolf/identified_sequences wolf/dereplicated_sequences
|
|
|
|
|
|
|
|
### 6. Denoise the sequence dataset
|
|
|
|
|
|
|
|
1. First let's clean the useless metadata and keep only the `COUNT` and `merged_sample` (count by sample) tags:
|
|
|
|
|
|
|
|
obi annotate -k COUNT -k MERGED_sample wolf/dereplicated_sequences wolf/cleaned_metadata_sequences
|
|
|
|
|
|
|
|
2. Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp:
|
|
|
|
|
|
|
|
obi grep -p "len(sequence)>=80 and sequence['COUNT']>=10" wolf/cleaned_metadata_sequences wolf/denoised_sequences
|
|
|
|
|
|
|
|
3. Clean the sequences from PCR/sequencing errors (sequence variants):
|
|
|
|
|
|
|
|
obi clean -s MERGED_sample -r 0.05 -H wolf/denoised_sequences wolf/cleaned_sequences
|
|
|
|
|
|
|
|
### 7. Taxonomic assignment of the sequences
|
|
|
|
|
|
|
|
#### Build a reference database
|
|
|
|
|
|
|
|
Building the reference database is costly in time and disk space so you can simply download this already built one:
|
|
|
|
|
|
|
|
[v05_refs.fasta.gz](/uploads/fe71e2e103014d70a1bf9307e377ce2b/v05_refs.fasta.gz)
|
|
|
|
|
|
|
|
With the associated taxdump:
|
|
|
|
|
|
|
|
[taxdump.tar.gz](https://drive.google.com/file/d/1lMV5PWg122ZmhQtx0iq5m5iUKvVBZMUC/view?usp=sharing)
|
|
|
|
|
|
|
|
And import them (note that you could import them in another DMS):
|
|
|
|
|
|
|
|
obi import v05_refs.fasta.gz wolf/v05_refs
|
|
|
|
obi import --taxdump taxdump.tar.gz wolf/taxonomy/my_tax
|
|
|
|
|
|
|
|
You can then resume at the next part "**Clean the database**".
|
|
|
|
|
|
|
|
Otherwise, to build the database yourself from the start:
|
|
|
|
|
|
|
|
1. Download the sequences (except human and environmental samples):
|
|
|
|
|
|
|
|
mkdir EMBL
|
|
|
|
cd EMBL
|
|
|
|
wget -nH --cut-dirs=5 -A rel_std_*.dat.gz -R rel_std_hum_*.dat.gz,rel_std_env_*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/std/
|
|
|
|
cd ..
|
|
|
|
|
|
|
|
2. Import the sequences in the DMS:
|
|
|
|
|
|
|
|
obi import --embl EMBL wolf/embl_refs
|
|
|
|
|
|
|
|
For EMBL files, you can give the path to a directory with several EMBL files.
|
|
|
|
|
|
|
|
**Note:** The reference database can be built in another DMS.
|
|
|
|
|
|
|
|
3. Download the taxonomy:
|
|
|
|
|
|
|
|
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
|
|
|
|
|
|
|
|
4. Import the taxonomy in the DMS:
|
|
|
|
|
|
|
|
obi import --taxdump taxdump.tar.gz wolf/taxonomy/my_tax
|
|
|
|
|
|
|
|
5. Use ecoPCR to simulate an *in silico* PCR with the V05 primers:
|
|
|
|
|
|
|
|
obi ecopcr -e 3 -l 50 -L 150 -F TTAGATACCCCACTATGC -R TAGAACAGGCTCCTCTAG --taxonomy wolf/taxonomy/my_tax wolf/embl_refs wolf/v05_refs
|
|
|
|
|
|
|
|
#### Clean the database
|
|
|
|
|
|
|
|
1. Filter sequences so that they have a good taxonomic description at the species, genus, and family levels:
|
|
|
|
|
|
|
|
obi grep --require-rank=species --require-rank=genus --require-rank=family --taxonomy wolf/taxonomy/my_tax wolf/v05_refs wolf/v05_refs_clean
|
|
|
|
|
|
|
|
2. Dereplicate identical sequences:
|
|
|
|
|
|
|
|
obi uniq --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_clean wolf/v05_refs_uniq
|
|
|
|
|
|
|
|
3. Ensure that the dereplicated sequences have a taxid at the family level:
|
|
|
|
|
|
|
|
obi grep --require-rank=family --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_uniq wolf/v05_refs_uniq_clean
|
|
|
|
|
|
|
|
|
|
|
|
4. Build the reference database specifically used by the OBITools3 to make ecotag efficient:
|
|
|
|
|
|
|
|
obi build_ref_db -t 0.97 --taxonomy wolf/taxonomy/my_tax wolf/v05_refs_uniq_clean wolf/v05_db_97
|
|
|
|
|
|
|
|
|
|
|
|
#### Assign each sequence to a taxon
|
|
|
|
|
|
|
|
Once the reference database is built, taxonomic assignment can be done using the `ecotag` command:
|
|
|
|
|
|
|
|
obi ecotag -m 0.97 --taxonomy wolf/taxonomy/my_tax -R wolf/v05_db_97 wolf/cleaned_sequences wolf/assigned_sequences
|
|
|
|
|
|
|
|
### 8. After the taxonomic assignment
|
|
|
|
|
|
|
|
#### Take a look at the results
|
|
|
|
|
|
|
|
For example:
|
|
|
|
|
|
|
|
obi stats -c SCIENTIFIC_NAME wolf/assigned_sequences
|
|
|
|
|
|
|
|
#### Align the sequences
|
|
|
|
|
|
|
|
obi align -t 0.95 wolf/assigned_sequences wolf/aligned_assigned_sequences
|
|
|
|
|
|
|
|
#### Check the history of everything that was done
|
|
|
|
|
|
|
|
The default history is in bash:
|
|
|
|
|
|
|
|
obi history wolf
|
|
|
|
|
|
|
|
The most visual way to look at the pipeline is:
|
|
|
|
|
|
|
|
obi history -d wolf > wolf.dot
|
|
|
|
obi history -d wolf/cleaned_sequences > wolf_one_view.dot
|
|
|
|
|
|
|
|
To look at the graph produced, open the dot file:
|
|
|
|
|
|
|
|
dot -Tx11 wolf.dot
|
|
|
|
|
|
|
|
or convert the dot file to a png image file:
|
|
|
|
|
|
|
|
dot -Tpng wolf.dot -o wolf.png
|
|
|
|
open wolf.png &
|
|
|
|
|
|
|
|
You will get something like this:
|
|
|
|
|
|
|
|
![wolf](/uploads/12ce2dd47456baa59fcfa3d09df2643c/wolf.png)
|
|
|
|
|
|
|
|
#### Export the results
|
|
|
|
|
|
|
|
Export in fasta format:
|
|
|
|
|
|
|
|
obi export --fasta-output wolf/assigned_sequences -o wolf_results.fasta
|
|
|
|
|
|
|
|
or:
|
|
|
|
|
|
|
|
obi export --fasta-output wolf/assigned_sequences > wolf_results.fasta
|
|
|
|
|
|
|
|
Export in csv-like format:
|
|
|
|
|
|
|
|
obi export --tab-output wolf/aligned_assigned_sequences > wolf_results.csv
|
|
|
|
|
|
|
|
### Contact
|
|
|
|
|
|
|
|
<celine.mercier@metabarcoding.org> |
|
|
|
\ No newline at end of file |