Commit a5fc5964 by Eric Coissac

continue documentation

parent 8a6849e8
The ORGanelle ASeMbler algorithmns
==================================
In a eukaryote cell, the genome is mainly stored in the nucleus but organelles,
*i.e.* mitochondrion and chloroplast for plants contain also genetic material.
|Orgasm| is an assembler dedicated to assemble these organelle genome sequences
from a low coverage shotgun sequencing.
|Orgasm| relies on the property that the copy number of organelle genomes is
higher per cell than the count of nuclear genome copies. The typical size for an
animal mitochondrial genome is about 16,000 base pairs (bp), and 150,000 bp
or 150 kb for a plant chloroplastic genome.
.. figure:: genome_size.*
:align: center
:figwidth: 80 %
:width: 600
.. figure:: algorithms.*
:align: center
:figwidth: 80 %
:width: 500
.. toctree::
:maxdepth: 2
algorithms/seedselection
algorithms/graphextension
algorithms/graphcleanning
algorithms/gapfilling
algorithms/graphunfolding
Graph extension
===============
.. figure:: ../extension.*
:align: center
:figwidth: 80 %
:width: 500
The assembling stack
Seed selection
==============
During the assembling process, to focus on a specific sequence, the mitochondrial
genome, the chloroplast genome or the nuclear rDNA, |orgasm| initiates the
assembling from a subset of reads supposed to belongs the targeted sequence.
The seed selection process aims to select this read subset. Reads are selected
because they present sequence similarity with genetic elements known to be
present on the targeted sequence.
Two kind of genetic elements can be used for selecting these reads:
- Protein encoding genes
- DNA sequences known to be present on the target sequence.
These genetics elements will be further named *seeds*. When prtein encoding
genes are used as seeds, the protein encoded sequence have to be used rather
than the nucleic gene sequence. This allow to be more sensitive during the read
selection and to use less phylogenetically closely related species sequence as
probe. As example the :ref:`protChloroArabidopsis <buildgraph.seeds>` seeds set
provided with the :ref:`buildgraph <oa_buildgraph>` command is constituted of 47
chloroplastic protein sequences from *Arabidopsis thaliana*. It allows to
initiate the assembling of most of the plant chloroplaste genomes.
|Orgasm| uses an algorithm similar, but simpler than the one used by `BLAST`_
and based on an `Aho Corasick automata`_.
Seed sequences are splitted in short words (kmer). The size of the words **kup**
is set by default to four for protein sequences and to twelve for DNA sequences.
This default size can be set up using the :ref:`--kup <buildgraph.kup>` option
of the :ref:`buildgraph <oa_buildgraph>` command.
When protein sequences are used as seeds, the kmers are back-translated to DNA
according to all the `NCBI genetic codes`_. Consequently a single proteic kmer
will be converted in a set of DNA words allowing to take into account the
genetic code degeneracy.
.. _`fig.backtranslate`:
.. figure:: ahocorasick.*
:align: center
:figwidth: 50 %
:width: 500
Protein sequences are splitted in short overlaping words.
Each small word is back-translated to DNA, reverse-complemented and
inserted into the Aho-Corasick automata. You can see for each protein
word a tree representing the Aho Corasick automata strucuture corresponding
to it. Each branch of these trees correspond to a DNA work issued from the
back-translation process and the reverse complement of these DNA words. This
give you an idea of the number of DNA words generated for each peptide.
The automata is filled
.. _`BLAST`: http://blast.ncbi.nlm.nih.gov/Blast.cgi
.. _`Aho Corasick automata`: https://en.wikipedia.org/wiki/Aho–Corasick_algorithm
.. _`NCBI genetic codes`: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
The organelle assembler commands
The ORGanelle ASseMbler commands
================================
.. toctree::
......@@ -8,3 +8,4 @@ The organelle assembler commands
assembling
finishing
unfolding
utilities
......@@ -6,3 +6,197 @@ The :program:`buildgraph` command
The :ref:`organelle assembler <oa>`'s :program:`buildgraph`
realizes the assembling of the reads by building the De Bruinj Graph which
is the central data structure used by the :ref:`organelle assembler <oa>`.
command prototype
-----------------
.. program:: oa buildgraph
.. code-block:: none
usage: oa buildgraph [-h] [--seeds seeds]
[--adapt5 adapt5] [--adapt3 adapt3]
[--coverage BUILDGRAPH:COVERAGE]
[--lowcomplexity]
[--minread BUILDGRAPH:MINREAD]
[--minoverlap BUILDGRAPH:MINOVERLAP]
[--minratio BUILDGRAPH:MINRATIO]
[--mincov BUILDGRAPH:MINCOV]
[--assmax BUILDGRAPH:ASSMAX]
[--smallbranches BUILDGRAPH:SMALLBRANCHES]
[--back ORGASM:BACK] [--snp]
index [output]
positional arguments
--------------------
.. option:: index
index root filename (produced by the oa index command)
.. option:: output
output prefix
optional arguments
------------------
General option
++++++++++++++
.. option:: -h, --help
show the help message and exit
Graph initialisation options
++++++++++++++++++++++++++++
.. _buildgraph.seeds:
.. option:: --seeds seeds
Seed sequences; either a fasta file containing seeds
sequences (nucleic or proteic) or the name of an internal
set of seeds among:
- ``nucrRNAAHypogastrura``
- ``nucrRNAArabidopsis``
- ``protChloroArabidopsis``
- ``protMitoCapra``
- ``protMitoMachaon``
.. code-block:: bash
$ oa buildgraph --seeds protChloroArabidopsis seqindex
.. _buildgraph.kup:
.. option:: --kup ORGASM:KUP
The word size used to identify the seed reads
[default: protein=4, DNA=12].
Graph extension options
+++++++++++++++++++++++
.. figure:: ../extension.*
:align: center
:figwidth: 80 %
:width: 500
The assembling stack
.. option:: --minread BUILDGRAPH:MINREAD
the minimum count of read to consider [default:
<estimated>]
.. code-block:: bash
$ oa buildgraph --seeds protChloroArabidopsis --minread 5 seqindex
Consider an extension if at least five reads are present in the extension
stack.
.. option:: --coverage BUILDGRAPH:COVERAGE
the expected sequencing coverage [default:
<estimated>]
.. option:: --minoverlap BUILDGRAPH:MINOVERLAP
minimum length of the overlap between the sequence and
reads to participate in the extension. [default:
<estimated>]
.. option:: --minratio BUILDGRAPH:MINRATIO
minimum ratio between occurrences of an extension and
the occurrences of the most frequent extension to keep
it. [default: <estimated>]
.. option:: --mincov BUILDGRAPH:MINCOV
minimum occurrences of an extension to keep it.
[default: 1]
Graph filtering options
+++++++++++++++++++++++
.. option:: --lowcomplexity
Use also low complexity probes
.. option:: --adapt5 adapt5
adapter sequences used to filter reads beginning by
such sequences; either a fasta file containing adapter
sequences or internal set of adapter sequences among
['adapt5ILLUMINA'] [default: adapt5ILLUMINA]
.. option:: --adapt3 adapt3
adapter sequences used to filter reads ending by such
sequences; either a fasta file containing adapter
sequences or internal set of adapter sequences among
['adapt3ILLUMINA'] [default: adapt3ILLUMINA]
Graph limit option
++++++++++++++++++
.. option:: --assmax BUILDGRAPH:ASSMAX
maximum base pair assembled
Graph cleaning options
++++++++++++++++++++++
.. option:: --smallbranches BUILDGRAPH:SMALLBRANCHES
After a cycle a extension, if you observe the assembling graph
you can observe a main path and many small aborted branches surrounding
this main path. They correspond to path initiated by a sequencing
error or a nuclear copy of a chloroplast region not enough covered by
the skimming sequencing to be successfuly extended.
One of the cleaning step consist in deleting these small branches.
This option indicates up to which lenght branches have to be deleted.
By default this legth is automaticaly estimated from the graph.
.. code-block:: bash
$ oa buildgraph --seeds protChloroArabidopsis \
--smallbranches 15 seqindex
During the cleaning steps, all the branches with a legth
shorter or equal to 15 base pairs will be deleted
.. option:: --snp
When the data set correspond to a pool of individuals, it is possible
that natural polymorphisms artificially complexy the assembling graph.
For helping the assembling process of such data set, this option will
clear the graph for such SNP by keeping only the most abundant allele
prsent in the dataset. The generated sequence can be considered as a
king of consensus. Read can be remapped in a second time on this
consensus using classical sofware like `BWA <bwa>`_ to get the lost SNP
information.
By default this option is deactivated
.. code-block:: bash
$ oa buildgraph --seeds protChloroArabidopsis \
--snp seqindex
Run the assembling, ignoring the SNPs.
Gap filling option
++++++++++++++++++
.. option:: --back ORGASM:BACK
The number of bases taken at the end of contigs to
jump with pared-ends [default: <estimated>]
.. _`bwa`: http://sourceforge.net/projects/bio-bwa/
.. _oa_compact:
The :program:`compact` command
==============================
.. _oa_cutlow:
The :program:`cutlow` command
=============================
.. _oa_fasta:
The :program:`fasta` command
============================
.. _oa_fillgaps:
The :program:`fillgaps` command
===============================
.. _oa_graphstats:
The :program:`graphstats` command
=================================
......@@ -3,17 +3,16 @@
The :program:`index` command
============================
The :ref:`organelle assembler <oa>`'s :program:`index` command indexes
|Orgasm|'s :program:`index` command indexes
lexicographicaly a set of sequence reads to make them usable by the
assembler.
The constraints of the :ref:`organelle assembler <oa>`
------------------------------------------------------
The constraints of the |orgasm|
-------------------------------
The :ref:`organelle assembler <oa>` was developed to deal with Illumina
paired-end reads. Consequently, the algortihm of the
:ref:`organelle assembler <oa>` requires that the indexed reads respect
several constraints.
|Orgasm| was developed to deal with Illumina
paired-end reads. Consequently, the algortihm of |orgasm|
requires that the indexed reads respect several constraints.
- The reads must be paired.
- They must have all the same length.
......@@ -40,33 +39,57 @@ These options allow for indicating:
They also permit to define the read length following three strategies.
Setting up the length of the indexed reads
------------------------------------------
The default strategy
++++++++++++++++++++
When nothing is specified the :ref:`organelle assembler <oa>`
:program:`index <oa_index>` commandconsiders that all the reads from the
dataset have the same length. Considering this, the actual read length of
the dataset is estimated from the first read of the forward file.
If the read length is even, it is decreased by one.
When nothing is specified |orgasm| :program:`index <oa_index>` command
considers that all the reads from the dataset have the same length.
Considering this, the actual read length of the dataset is estimated
from the first read of the forward file. If the read length is even,
it is decreased by one.
During the indexing procedure, reads shorter than the limit are discarded,
read longuer than the limit are trimmed on their 3' end to fit the good
length. After the trimming reads containing :ref:`IUPAC <iupac_code>` ambiguity
code are also discarded.
During the indexing procedure, read pairs containing a read shorter than
the limit are discarded, read longer than the limit are trimmed on their
3' end to fit the good length. After read pairs containing a
:ref:`IUPAC <iupac_code>` ambiguity code are also discarded.
The user defined read length
++++++++++++++++++++++++++++
Using the option :ref:`--length <opt_idx_length>`
Using the option :ref:`--length <index.length>`, users can specify the read
length to index. If the specified read length is even, it is decreased by one.
As for the default strategy, read pairs containing a read shorter than the
specified limit are discarded, read longuer than the limit are trimmed on
their 3' end to fit the good length. After read pairs containing a
:ref:`IUPAC <iupac_code>` ambiguity code are also discarded.
The estimated read length
+++++++++++++++++++++++++
Running the :program:`index` with the option
:ref:`--estimate-length FRACTION <index.estimate-length>` asks to estimate
the maximum length usable to use at least ``FRACTION`` part of the dataset.
``FRACTION`` is a float number ranging between *0.0* and *1.0*.
In this mode the dataset is read a first time and the longest sub-sequence of
each read containing no :ref:`IUPAC <iupac_code>` ambiguity code is extracted.
The length distribution of these sub-sequences is computed. According to this
distribution the maximal length allowing to use at least ``FRACTION`` part
of the dataset is estimated.
Running the :program:`index`
Only the sub-sequences without :ref:`IUPAC <iupac_code>` ambiguity code are
indexed. Read pairs containing a read shorter than the estimated length are
discarded
One of the most common way for running the :program:`index` command looks like
to the following command:
The most common way to run the index command
--------------------------------------------
The basic unix command for running the :program:`index` command looks like
to this:
.. code-block:: bash
......@@ -74,45 +97,67 @@ to the following command:
seqindex \
forward.fastq.gz reverse.fastq.gz
the :program:`index` command creates four files :
- <index>.ogx : contains information concerning the index
- <index>.ofx : contains the sequences themselves and the forward index
- <index>.orx : contains reverse index
- <index>.opx : contains read pairing data
- ``<index>.ogx`` : contains information concerning the index
- ``<index>.ofx`` : contains the sequences themselves and the forward index
- ``<index>.orx`` : contains reverse index
- ``<index>.opx`` : contains read pairing data
The :ref:`organelle assembler <oa>` will need all these file to process assembling.
`<index>` represents the name of index that will be used later by the assembler.
|Orgasm| will need all these file to process assembling.
``<index>``` represents the name of index that will be used later by the assembler.
A fifth file named ``<index>.log`` contains the traces generated by the indexation
process.
command prototype
-----------------
.. code-block:: none
$ oa index [--single | --mate-pairs] \
[--check-ids] [--check-pairing] \
[--max-read ###] \
[--length ### | --estimate-length #.##] \
[--fasta | --forward-fasta | --reverse-fasta] \
[--no-pipe] \
<index> <forward_fastq_file> [reverse_fastq_file]
usage: $ oa index [-h] [--single | --mate-pairs]
[--check-ids] [--check-pairing]
[--max-read ###]
[--length ### | --estimate-length #.##]
[--fasta | --forward-fasta | --reverse-fasta]
[--no-pipe]
<index> <forward_fastq_file> [reverse_fastq_file]
positional arguments
--------------------
.. option:: index
Name of the produced index
options
-------
.. option:: forward
Filename of the forward reads
.. option:: reverse
Filename of the reverse reads
optional arguments
------------------
.. program:: oa index
General option
++++++++++++++
.. option:: -h, --help
show the help message and exit
Sequencing strategy
+++++++++++++++++++
.. cmdoption:: --single
.. option:: --single
Single read mode.
.. cmdoption:: --mate-pairs
.. option:: --mate-pairs
Indicates that the two read files were obtained using a mate pair
sequencing strategy.
......@@ -120,13 +165,13 @@ Sequencing strategy
Sequence file checking
++++++++++++++++++++++
.. cmdoption:: --check-ids
.. option:: --check-ids
Checks that forward and reverse ids are identical.
The two sequence ids `seqid/1` and `seqid/2` are considered as
identical.
.. cmdoption:: --check-pairing
.. option:: --check-pairing
Ensure that forward and reverse files are correctly paired.
The pairing is checked based on the sequence identifier.
......@@ -136,7 +181,7 @@ Sequence file checking
Limit for the indexation
++++++++++++++++++++++++
.. cmdoption:: --max-read ###
.. option:: --max-read ###
`###` indicates the number of millions of reads to index. If not
specified all the reads are indexed within the limit imposed by
......@@ -148,9 +193,9 @@ Limit for the indexation
Build the index with a maximum of four millons of reads.
.. _opt_idx_length:
.. _index.length:
.. cmdoption:: --length ###
.. option:: --length ###
`###` represents the read length to consider. Only reads
with a length greater or equal to `###` will be indexed. Reads longer
......@@ -167,7 +212,9 @@ Limit for the indexation
from the length of the first read of the forward file or through the
:option:`--estimate-length #.##` option.
.. cmdoption:: --estimate-length #.##
.. _index.estimate-length:
.. option:: --estimate-length #.##
`#.##` ranging between 0.0 and 1.0, indicates which fraction
of the overall dataset we want to use. When this option is used
......@@ -183,7 +230,7 @@ Limit for the indexation
Sequence file format
++++++++++++++++++++
.. cmdoption:: --fasta
.. option:: --fasta
Indicates than the two sequence files to index are :ref:`fasta <fasta>` files.
......@@ -191,7 +238,7 @@ Sequence file format
$ oa index --fasta seqindex forward.fasta reverse.fasta
.. cmdoption:: --forward-fasta
.. option:: --forward-fasta
Indicates than the forward file is a fasta file
......@@ -199,7 +246,7 @@ Sequence file format
$ oa index --forward-fasta seqindex forward.fasta reverse.fastq
.. cmdoption:: --reverse-fasta
.. option:: --reverse-fasta
Indicates than the reverse file is a fasta file
......@@ -224,7 +271,7 @@ compressed with `bzip2`_.
System option
+++++++++++++
.. cmdoption:: --no-pipe
.. option:: --no-pipe
By default the :ref:`organelle assembler <oa>` uses named pipes to transfer
data among programs. Using this option you can enforce to use
......
.. _oa_list:
The :program:`list` command
===========================
.. _oa_path:
The :program:`path` command
=============================
.. _oa_unfold:
The :program:`unfold` command
=============================
.. _oa_unfoldrdna:
The :program:`unfoldrdna` command
=================================
......@@ -46,7 +46,8 @@ extensions = ['sphinx.ext.autodoc', 'sphinx.ext.todo',
# 'matplotlib.sphinxext.ipython_directive',
'sphinx.ext.doctest',
'ipython_console_highlighting',
'numpydoc'
'numpydoc',
'numfig'
]
# Add any paths that contain templates here, relative to this directory.
......@@ -65,6 +66,11 @@ master_doc = 'index'
project = u'Organelle Assembler'