Commit 03faa946 by Eric Coissac

--no commit message

parent 6c497381
......@@ -6,12 +6,31 @@
.. cmdoption:: -a, --all
Prints only the total count of sequence records (if a sequence has no `count` attribute, its default count is 1) (default: False).
Prints only the sum of ``count`` attributes.
If a sequence has no `count` attribute, its default count is 1.
*Example:*
.. code-block:: bash
> obicount -a seq.fasta
For all sequence records contained in the ``seq.fasta`` file, prints only
the sum of ``count`` attributes.
.. cmdoption:: -s, --sequence
Prints only the number of sequence records.
*Example:*
.. code-block:: bash
> obicount -s seq.fasta
Prints only the number of sequence records contained in the ``seq.fasta`` file.
.. include:: ../optionsSet/inputformat.txt
.. include:: ../optionsSet/defaultoptions.txt
......
......@@ -7,41 +7,43 @@
.. cmdoption:: -m <KEY>, --merge=<KEY>
Attribute to merge
Attribute to merge.
*Example:*
*Example:*
.. code-block:: bash
.. code-block:: bash
> obiuniq -m sample seq1.fasta > seq2.fasta
> obiuniq -m sample seq1.fasta > seq2.fasta
Dereplicates sequences and keep track of samples of origin
Dereplicates sequences and keeps the value distribution of the ``sample`` attribute
in the new attribute ``merged_sample``.
.. cmdoption:: -i , --merge-ids',
.. cmdoption:: -i , --merge-ids
Add a ``merged`` attribute containing the list of sequence record ids merged
within this group
Adds a ``merged`` attribute containing the list of sequence record ids merged
within this group.
.. cmdoption:: -c <KEY>, --category-attribute=<KEY>
Add one attribute to the list of attributes used to group sequences before
dereplication (option can be used several times)
Adds one attribute to the list of attributes used to define sequence groups
(this option can be used several times).
*Example:*
*Example:*
.. code-block:: bash
.. code-block:: bash
> obiuniq -c sample seq1.fasta > seq2.fasta
> obiuniq -c sample seq1.fasta > seq2.fasta
Dereplicates sequences within each sample
Dereplicates sequences within each sample.
.. cmdoption:: -p','--prefix',
.. cmdoption:: -p, --prefix
Dereplication is done based on prefix matching:
Dereplication is done based on prefix matching:
1. The shortest sequence of each group is a prefix of any sequence of its group
2. The shortest sequence of a group is the prefix of only the sequence belonging
to its group
1. The shortest sequence of each group is a prefix of any sequence of its group
2. The shortest sequence of a group is the prefix of only the sequences belonging
to its group
.. include:: ../optionsSet/taxonomyDB.txt
......
......@@ -8,56 +8,65 @@
Filename containing a list of oligonucleotides. `oligotag` selects within this list
the oligonucleotides that match the specified options.
Cannot be used with the -s option.
.. CAUTION:: Cannot be used with the ``-s`` option.
.. cmdoption:: -s ###, --oligo-size=###
Size of oligonucleotides to be generated.
Cannot be used with the -L option.
.. CAUTION:: Cannot be used with the ``-L`` option.
.. WARNING:: A size equal or greater than 8 leads to very long computing time and large memory.
.. WARNING:: A size equal or greater than eight often leads to a very long
computing time and requires a large amount of memory.
.. cmdoption:: -f ###, --family-size=###
Size of oligonucleotide family to be generated.
Minimal size of the oligonucleotide family to be generated.
.. cmdoption:: -d ###, --distance=###
Minimal distance between two oligonucleotides.
Minimal Hamming distance (number of differences)
between two oligonucleotides.
.. cmdoption:: -g ###, --gc-max=###
Maximum count of G or C nucleotides acceptable in the oligonucleotides.
Maximum number of G or C in the oligonucleotides.
.. cmdoption:: -a <IUPAC pattern>, --accepted=<IUPAC pattern>
.. cmdoption:: -a <IUPAC_PATTERN>, --accepted=<IUPAC_PATTERN>
Pattern of accepted oligonucleotide using the `IUPAC <../iupac>` code.
Selected oligonucleotides are constrained by the given pattern
(only :doc:`IUPAC <../iupac>` symbols are allowed).
.. CAUTION:: pattern length must have the same length as oligonucleotides.
.. cmdoption:: -r <IUPAC pattern>, --rejected=<IUPAC pattern>
.. cmdoption:: -r <IUPAC_PATTERN>, --rejected=<IUPAC_PATTERN>
Pattern of rejected oligonucleotide using the `IUPAC <../iupac>` code.
Selected oligonucleotides do not match the given pattern
(only :doc:`IUPAC <../iupac>` symbols are allowed).
.. CAUTION:: pattern length must have the same length as oligonucleotides.
.. cmdoption:: -p ###, --homopolymer=###
Reject oligonucleotides with homopolymer longer than the specified number.
Selected oligonucleotides do not contain any homopolymer
longer than the specified length.
.. cmdoption:: -P ###, --homopolymer-min=###
Accept only oligonucleotides with homopolymer longer or equal to the specified number.
Selected oligonucleotides contain at least one homopolymer longer
or equal to the specified length.
.. cmdoption:: -T <seconde>, --timeout=<seconde>
Timeout to identify a set of oligonucleotides of good size, as defined by the -f option.
Timeout to identify a set of oligonucleotides of required size,
as defined by the ``-f`` option.
.. include:: ../optionsSet/defaultoptions.txt
......@@ -74,8 +83,8 @@
Searches for a family of at least 24 oligonucleotides of a length of 5 nucleotides,
with at least 3 differences among them, with a maximum of 3 C/G, and without
homopolymers longer than 2. The corresponding list of oligonucleotides is saved in
the mytags.txt file.
homopolymers longer than 2. The resulting list of oligonucleotides is saved in
the ``mytags.txt`` file.
*Example 2:*
......@@ -84,10 +93,10 @@
> oligotag -d 5 -L my_oligos.txt -f 10 -p 1
Searches for a family of at least 10 oligonucleotides in the my_oligos.txt file, with
at least 5 differences among them, and without homopolymers. The my_oligos.txt file must
contain only a set of oligonucleotides of the same length, with one oligonucleotide per line.
The corresponding list of oligonucleotides is printed on the terminal window.
Searches for a subset of at least 10 oligonucleotides listed in the ``my_oligos.txt`` file, with
at least 5 differences among them, and without homopolymers. The ``my_oligos.txt`` file must
contain a set of oligonucleotides of the same length, with only one oligonucleotide per line.
The resulting list of oligonucleotides is printed on the terminal window.
......@@ -98,9 +107,8 @@
> oligotag -s 7 -f 96 -d 3 -p 1 -r cnnnnnn > mytags.txt
Searches for a family of at least 96 oligonucleotides of a length of 7 nucleotides,
with at least 3 differences among them, without homopolymers, and without a 'c' in
the first position. The corresponding list of 105 oligonucleotides is saved in
the mytags.txt file.
with at least 3 differences among them, without homopolymers, and without a ``C`` in
the first position. The resulting list is saved in the ``mytags.txt`` file.
*Example 4:*
......@@ -110,9 +118,10 @@
> oligotag -s 9 -f 24 -d 3 -a yryryryry > mytags.txt
Searches for a family of at least 24 oligonucleotides of a length of 9 nucleotides,
with at least 3 differences among them, with an alternation of pyrimidines and purines.
The corresponding list of 25 oligonucleotides is saved in the mytags.txt file. With the
constraints imposed by the -a option, it is possible to have longer oligonucleotides.
with at least 3 differences among them, and an alternation of pyrimidines and purines.
The resulting list is saved in the ``mytags.txt`` file. Because of the
constraints imposed by the ``-a`` option, it is possible to compute longer oligonucleotides
in a reasonable time.
Reference
......
......@@ -5,3 +5,4 @@ Statistics over sequence file
:maxdepth: 2
scripts/obicount
scripts/obistat
......@@ -6,4 +6,5 @@ Utilities
:maxdepth: 2
scripts/oligotag
scripts/obisort
\ No newline at end of file
......@@ -5,27 +5,16 @@
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
:py:mod:`obicount` counts the number of sequence records and/or their total count (with the `count` attribute) in a sequence file.
:py:mod:`obicount` counts the number of sequence records and/or the sum of the ``count`` attributes.
Examples:
*Example:*
.. code-block:: bash
> obicount seq.fasta
Prints both the number of sequence records and their total count in the ``seq.fasta`` file
.. code-block:: bash
> obicount -a seq.fasta
Prints only the total count of sequence records in the ``seq.fasta`` file.
.. code-block:: bash
> obicount -s seq.fasta
Prints only the number of sequence records in the ``seq.fasta`` file.
Prints the number of sequence records contained in the ``seq.fasta``
file and the sum of their ``count`` attributes.
'''
from obitools.options import getOptionManager
......
#!/usr/local/bin/python
'''
:py:mod:`obigrep` : Filters sequence file
:py:mod:`obigrep`: Filters sequence file
=========================================
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
......
#!/usr/local/bin/python
'''
:py:mod:`obihead` : Extracts the first sequence records
:py:mod:`obihead`: Extracts the first sequence records
=======================================================
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
......
#!/usr/local/bin/python
'''
:py:mod:`obisplit` : Splits a sequence file in a set of subfiles
:py:mod:`obisplit`: Splits a sequence file in a set of subfiles
================================================================
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
......
#!/usr/local/bin/python
'''
:py:mod:`obitail` : Extracts the last sequence records
:py:mod:`obitail`: Extracts the last sequence records
======================================================
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
......
#!/usr/local/bin/python
'''
:py:mod:`obiuniq` : Groups and dereplicates sequences
=====================================================
:py:mod:`obiuniq`: Groups and dereplicates sequences
====================================================
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>
The :py:mod:`obiuniq` command is in some way analog to the standard Unix ``uniq -c`` command.
Instead of working text line by text line as the standard Unix tool, the processing is done on sequence records.
Instead of working text line by text line as the standard Unix tool, the processing is done on
sequence records.
A sequence record is a :doc:`complex object <../fasta>` composed of an identifier, a set of attributes (``key=value``),
a definition, and the sequence itself.
A sequence record is a :doc:`complex object <../fasta>` composed of an identifier, a set of
attributes (``key=value``), a definition, and the sequence itself.
The :py:mod:`obiuniq` command groups together sequence records if their sequence are strictly identical (when using
option ``-c``, sequence records are first grouped based on the value of this attribute and then sequence dereplication
is done within these groups; when using option ``-p``, a sequence record belong to a group when the shortest
sequence of the group is a prefix of the sequence of the record). Then, for each group, a sequence record is printed.
The :py:mod:`obiuniq` command groups together sequence records. Then, for each group, a sequence
record is printed.
As the identifier, the set of attributes (``key=value``) and the definition of the sequence records that are grouped
together may be different, two options (``-m`` and ``-i``) allows to refine how these parts of the records are build.
A group is defined by the sequence and optionally by the values of a set of attributes
specified with the ``-c`` option.
By default, only attributes with identical values within a group of sequence records are kept. Nevertheless, all
attributes can be merged (``-m`` option). In such case, a new attribute with name prefixed by ``merged_`` is created,
counting, for each value, the number of times it occurs within the group.
As the identifier, the set of attributes (``key=value``) and the definition of the sequence
records that are grouped together may be different, two options (``-m`` and ``-i``)
allow refining how these parts of the records are reported.
The command also set the ``count`` attribute to the total number of sequence records for each group.
- By default, only attributes with identical values
within a group of sequence records are kept.
In case where a taxonomy is loaded (``-d`` or ``-t`` options), the ``merged_taxid`` attribute is created and records
the number of times taxids have been found in the group (it may be empty if no sequence record have ``taxid`` attribute
in the group). In addition, a set of taxonomy-related attributes are generated for each group having at
least one sequence record with a ``taxid`` attribute. The ``taxid`` attribute is set to the /Last Common Ancestor/ of
the taxids of the group. All other taxonomy-related attributes created (``species``, ``genus``, ``family``,
``species_name``, ``genus_name``, ``family_name``, ``rank``, ``scientific_name``) give information on this taxid.
- A ``count`` attribute is set to the total number of sequence records for each group.
- For each attribute specified by the ``-m`` option, a new attribute whose key is prefixed
by ``merged_`` is created. These new attributes contain the number of times each value
occurs within the group of sequence records.
:py:mod:`obiuniq` and taxonomic information
-------------------------------------------
When a taxonomy is loaded (``-d`` or ``-t`` options), the ``merged_taxid``
attribute is created and records the number of times each taxid has been found in the
group (it may be empty if no sequence record has a ``taxid`` attribute in the group).
In addition, a set of taxonomy-related attributes are generated for each group having at
least one sequence record with a ``taxid`` attribute. The ``taxid`` attribute of the sequence
group is set to the *Last Common Ancestor* of the taxids of the group. All other taxonomy-related
attributes created (``species``, ``genus``, ``family``, ``species_name``, ``genus_name``,
``family_name``, ``rank``, ``scientific_name``) give information on the *Last Common Ancestor*.
'''
......
......@@ -8,7 +8,7 @@
:py:mod:`oligotag` designs a set of oligonucleotides that can be used for tagging a set
of samples during PCR reactions, by adding the oligonucleotides on the 5' end of the primers.
Many options allow to design a set of oligonucleotides according to specified properties.
Many options allow designing a set of oligonucleotides according to specified properties.
'''
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment