Commit 57f5ad12 by Eric Coissac

--no commit message

parent 1a0a8844
#!/usr/local/bin/python
'''
:py:mod:`obiclean`: Tags a set of sequences for PCR/sequencing errors identification (sequence variants)
========================================================================================================
:py:mod:`obiclean`: Tags a set of sequences for PCR/sequencing errors identification
====================================================================================
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
:py:mod:`obiclean` is a command that classify sequence records either as ``head``, ``internal`` or ``singleton``.
:py:mod:`obiclean` is a command that classifies sequence records either as ``head``, ``internal`` or ``singleton``.
For that purpose, two informations are used:
- counting
For that purpose, two pieces of information are used:
- sequence record counts
- sequence similarities
A sequence record is considered a variant of another sequence record iff:
- its counting is lower than the counting of the other sequence record
(this can be adjusted with the ``-r`` option)
- their sequence are *related* (they can align with some errors,
the number of errors can be specified by the ``-d`` option)
*S1* a sequence record is considered as a variant of *S2* another sequence record if and only if:
- ``count`` of *S1* divided by ``count`` of *S2* is lesser than the ratio *R*.
*R* default value is set to 1, and can be adjusted between 0 and 1 with the ``-r`` option.
- both sequences are *related* to one another (they can align with some differences,
the maximum number of differences can be specified by the ``-d`` option).
The following properties hold for a sequence record *S* tagged as (exclusive):
Considering *S* a sequence record, the following properties hold for *S* tagged as:
- ``head``:
+ there exists *at least one* sequence record in the dataset that is a variant of *S*
+ there exists *no* sequence record in the dataset such that *S* is a variant of this
sequence record
+ there exists **at least one** sequence record in the dataset that is a variant of *S*
+ there exists **no** sequence record in the dataset such that *S* is a variant of this
sequence record
- ``internal``:
+ there exists *at least one* sequence record in the dataset such that *S* is a variant
of this sequence record
- ``singleton`` :
+ there exists *no* sequence record in the dataset that is a variant of *S*
+ there exists *no* sequence record in the dataset such that *S* is a variant of this
sequence record
+ there exists **at least one** sequence record in the dataset such that *S* is a variant
of this sequence record
- ``singleton``:
+ there exists **no** sequence record in the dataset that is a variant of *S*
+ there exists **no** sequence record in the dataset such that *S* is a variant of this
sequence record
By default, tagging is done once for the whole dataset, but it can also be done sample by sample
by specifying the ``-s`` option. In such a case, the counting is extracted from the samples
by specifying the ``-s`` option. In such a case, the counts are extracted from the sample
information.
Finally, each sequence record is annotated with three new attributes ``head``, ``internal`` and
``singleton``. The values are the number of samples in which the sequence record has been classified
in this manner.
``singleton``. The attribute values are the numbers of samples in which the sequence record has
been classified in this manner.
'''
from obitools.format.options import addInOutputOption, sequenceWriterGenerator
......
......@@ -6,11 +6,16 @@
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
:py:mod:`obistats` computes basic statistics for attribute values of sequence records.
The sequence records can be categorized or not using the ``-c`` option, and several ``-c`` options can be combined.
The sequence records can be categorized or not using one or several ``-c`` options.
By default, only the number of sequence records and the total count are computed for each category.
Additional statitics can be computed for attribute values in each category, like
minimum value (``-m`` option), maximum value (``-M`` option), mean value
(``-a`` option), variance (``-v`` option) or standard deviation (``-s`` option).
Additional statistics can be computed for attribute values in each category, like:
- minimum value (``-m`` option)
- maximum value (``-M`` option)
- mean value (``-a`` option)
- variance (``-v`` option)
- standard deviation (``-s`` option)
The result is a contingency table with the different categories in rows, and the
computed statistics in columns.
......@@ -105,7 +110,8 @@ def sd(values,options):
if __name__ == "__main__":
optionParser = getOptionManager([addStatOptions,addInputFormatOption,addTaxonomyDBOptions])
optionParser = getOptionManager([addStatOptions,addInputFormatOption,addTaxonomyDBOptions],
progdoc=__doc__)
(options, entries) = optionParser()
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment