Commit 7ed767e6 by Eric Coissac

--no commit message

parent 56cbe063
......@@ -43,8 +43,8 @@ ecoPCR_ is a software developed in LECA_. It simulates a PCR experiment by
selecting in a sequence database, sequences matching simultaneously two
primers sequences in a way allowing a PCR amplification of a DNA region.
The ecoPrimer files
-------------------
The ecoPrimers files
--------------------
The OBITools files
......
.. automodule:: ecoDBTaxStat
.. automodule:: ecodbtaxstat
:py:mod:`ecoDBTaxStat` specific options
------------------------------------
:py:mod:`ecodbtaxstat` specific option
--------------------------------------
.. cmdoption:: --rank=<TAXONOMIC_RANK>
The taxonomic rank at which frequencies have to be computed.
Possible values are :
Possible values are:
- class
- family
- forma
......@@ -14,11 +14,10 @@
- infraclass
- infraorder
- kingdom
- no rank
- order
- parvorder
- phylum
- species [default]
- species (default)
- species group
- species subgroup
- subclass
......
......@@ -5,36 +5,45 @@
.. cmdoption:: -f <FORMAT>, --format=<FORMAT>
Format of the sequence file. Possibilities : ``raw``, ``UNITE`` or ``SILVA``
(default: ``raw``).
Format of the sequence file. Possible formats are:
- ``raw``: for regular or :doc:`OBITools extended fasta <../fasta>` files (default value).
- ``UNITE``: for fasta files downloaded from the `UNITE web site <http://unite.ut.ee/>`_.
- ``SILVA``: for fasta files downloaded from the `SILVA web site <http://www.arb-silva.de/>`_.
.. cmdoption:: -k <KEYNAME>, --key-name=<KEYNAME>
Key of the attribute containing the taxon name in sequence files in ``raw`` format.
Default: the taxon name is :doc:`the id of the sequence record<../fasta>`. The taxon name
MUST have ``_`` between the words of the name when it's the *id*, and CAN be in this format
when it's in an attribute.
Key of the attribute containing the taxon name in sequence files in
:doc:`OBITools extended fasta <../fasta>` format.
.. cmdoption:: -a <ANCESTOR>, --restricting_ancestor=<ANCESTOR>
Can be a word or a number (taxid). Enables to restrict the search of taxids under a
specified ancestor. If it is a word, it is the attribute key associated with the
ancestor's taxid in each sequence record (it can be different for each sequence record).
If it is a number, it is the taxid of the ancestor (in which case it is the same for all
the sequence records).
Enables to restrict the search of taxids under a specified ancestor.
``<ANCESTOR>`` can be a taxid (integer) or a key (string).
- If it is a taxid, this taxid is used to restrict the search for all the sequence
records.
- If it is a key, :py:mod:`obiaddtaxids`: looks for the ancestor taxid in the
corresponding attribute. This allows having a different ancestor restriction
for each sequence record.
.. cmdoption:: -g <FILENAME>, --genus_found=<FILENAME>
File used to store sequences with a match found for the genus (not with UNITE databases).
File used to store sequences with a match found for the genus.
.. CAUTION:: this option is not valid with the UNITE format.
.. cmdoption:: -s <FILENAME>, --dirty=<FILENAME>
File used to store sequences with a match found for one of the words from the taxon name
searched (not with UNITE databases).
.. cmdoption:: -u <FILENAME>, --unidentified=<FILENAME>
File used to store sequences with no match found.
File used to store sequences with no taxonomic match found.
.. include:: ../optionsSet/taxonomyDB.txt
......@@ -45,12 +54,20 @@
.. code-block:: bash
> obiaddtaxids -T species_name -g genus_identified.fasta -u unidentified.fasta -d my_ecopcr_database_prefix my_sequences.fasta > identified.fasta
Tries to match the value associated with the ``species_name`` key of each sequence record from ``my_sequences.fasta``
with a taxon name from the ecopcr database ``my_ecopcr_database_prefix``. If there is an exact match, the sequence
record is printed in ``identified.fasta``. If not and the ``species_name`` value is composed of two words, tries to
match the first word with a taxon name from the ecopcr database. If there is a match, the sequence record is printed in
``genus_identified.fasta``. If the sequence record was printed in neither ``identified.fasta`` nor in ``genus_identified``,
it is printed in ``unidentified.fasta``.
> obiaddtaxids -T species_name -g genus_identified.fasta \
-u unidentified.fasta -d my_ecopcr_database \
my_sequences.fasta > identified.fasta
Tries to match the value associated with the ``species_name`` key of each sequence record
from the ``my_sequences.fasta`` file with a taxon name from the ecopcr database ``my_ecopcr_database``.
- If there is an exact match, the sequence record is stored in the ``identified.fasta`` file.
- If not and the ``species_name`` value is composed of two words, :py:mod:`obiaddtaxids`:
considers the first word as a genus name and tries to find it into the taxonomic database.
- If a genus is found, the sequence record is stored in the ``genus_identified.fasta``
file.
- Otherwise the sequence record is stored in the ``unidentified.fasta`` file.
\ No newline at end of file
......@@ -4,5 +4,6 @@ Statistics over sequence file
.. toctree::
:maxdepth: 2
scripts/ecodbtaxstat
scripts/obicount
scripts/obistat
#!/usr/local/bin/python
'''
:py:mod:`ecoDBTaxStat` : Gives taxonomic rank frequency of a given ecoPCR database
==================================================================================
:py:mod:`ecodbtaxstat`: Gives taxonomic rank frequency of a given ``ecopcr`` database
=====================================================================================
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
The :py:mod:`ecoDBTaxStat` command requires an ecoPCR database and a taxonomic rank
The :py:mod:`ecodbtaxstat` command requires an ``ecopcr`` database and a taxonomic rank
(specified by the ``--rank`` option, default *species*). The command outputs first
the total number of sequence records in the database having taxonomic information at this rank,
and then the number of sequence records for each value of this rank.
......@@ -22,7 +22,7 @@ from obitools.ecopcr.sequence import EcoPCRDBSequenceIterator
def addRankOptions(optionManager):
group = optionManager.add_option_group('ecoDBTaxStat specific option')
group = optionManager.add_option_group('ecodbtaxstat specific option')
group.add_option('--rank',
action="store", dest="rank",
metavar="<taxonomic rank>",
......@@ -31,11 +31,11 @@ def addRankOptions(optionManager):
help="The taxonomic rank at which frequencies have to be computed. "
"Possible values are: "
"class, family, forma, genus, infraclass, infraorder, kingdom, "
"no rank, order, parvorder, phylum, species, species group, "
"order, parvorder, phylum, species, species group, "
"species subgroup, subclass, subfamily, subgenus, subkingdom, "
"suborder, subphylum, subspecies, subtribe, superclass, "
"superfamily, superkingdom, superorder, superphylum, tribe or varietas. "
"[Default: species]")
"(Default: species)")
def cmptax(taxonomy):
......
#!/usr/local/bin/python
'''
:py:mod:`obiaddtaxids`: Adding taxids to sequence records using an ecopcr database
==================================================================================
:py:mod:`obiaddtaxids`: Adds taxids to sequence records using an ecopcr database
================================================================================
.. codeauthor:: Celine Mercier <celine.mercier@metabarcoding.org>
The :py:mod:`obiaddtaxids` command takes a sequence file in either fasta, SILVA or
UNITE format and an ecopcr database as inputs.
The :py:mod:`obiaddtaxids` command annotates sequence records with a taxid based on
a taxon scientific name stored in the sequence record header.
If the sequence file is in fasta format, the user should specify where to find the
taxon name associated with the sequence using the ``-T`` option.
Taxonomic information linking a taxid to a taxon scientific name is stored in a
database formated as an ecopcr database (see :doc:`obitaxonomy <obitaxonomy>`) or
a NCBI taxdump (see NCBI ftp site).
For each sequence record, :py:mod:`obiaddtaxids` will try to match its taxon name
with one from the ecopcr database, and will print it with the associated taxid if
a match is found.
The way to extract the taxon scientific name from the sequence record header can be
refined by two options:
:py:mod:`obiaddtaxids` can associate a sequence record with a taxon from the ecopcr
database in three different ways :
- If the taxon name matches exactly one in the ecopcr database, the sequence record
is printed with a new attribute having the key ``taxid``, and the taxid associated
with the matching taxon as its value.
- By default, the sequence identifier is used. Underscore characters (_) are substituted
by spaces before looking for the taxon scientific name into the taxonomic
database.
- If the input file is an :doc:`OBITools extended fasta format <../fasta>`, the ``-k`` option
specifies the attribute containing the taxon scientific name.
- If the input file is a fasta file imported from the UNITE or from the SILVA web sites,
the ``-f`` option allows specifying this source and parsing correctly the associated
taxonomic information.
For each sequence record, :py:mod:`obiaddtaxids` tries to match the extracted taxon scientific name
with those stored in the taxonomic database.
- If a match is found, the sequence record is annotated with the corresponding ``taxid``.
Otherwise
- If the ``-g`` option is set, the taxon name is composed of two words, and the first
one matches a taxon name from the ecopcr database, the sequence record is printed
in the file specified by the ``-g`` option.
- If the ``-g`` option is set and the taxon name is composed of two words and only the
first one is found in the taxonomic database, :py:mod:`obiaddtaxids` considers that
it found the genus associated with this sequence record and it stores this sequence
record in the file specified by the ``-g`` option.
- If the ``-s`` option is set and the exact taxon name, nor its first word if the
``-g`` option was set, matched with a taxon from the ecopcr database, each word
from the taxon name are searched. The sequences identified this way are written
in the file set by the ``-s`` option.
If the ``-u`` option is set and a sequence was printed neither in the output, the
``-g`` file nor the ``-s`` file, it is printed in the file set by the ``-u`` option.
- If the ``-u`` option is set and no taxonomic information is retrieved from the
scientific taxon name, the sequence record is stored in the file specified by the
``-u`` option.
'''
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment