Commit b05438a3 authored by Eric Coissac's avatar Eric Coissac

Merge branch 'master' of git@git.metabarcoding.org:org-asm/org-asm.git

parents 4e9cae0f 3504c054
0.1.12 0.1.13
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
The assembly graph
------------------
|Orgasm| rely on a graph representation of the ongoing assembly. The graph used is what is
called a `De Bruijn graph <https://en.wikipedia.org/wiki/De_Bruijn_graph>`_ but restricted
to nodes representing substrings of a given length of the genome to reconstruct.
.. figure:: ./graph.*
:align: center
:figwidth: 80 %
:width: 500
An example of the underlying graph (adapted from `Compeau et al. (2011) <http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html>`_)
In our particular case, we work with words of the length of the reads produced by the shotgun
sequencing. So the graph is also what is called a string graph for our set of reads (see
`Meyers (2005) <http://bioinformatics.oxfordjournals.org/content/21/suppl_2/ii79.abstract>`_
for details).
...@@ -3,6 +3,12 @@ ...@@ -3,6 +3,12 @@
The :program:`seeds` command The :program:`seeds` command
============================ ============================
.. note::
For most of the users this command is useless, because this task is automaticaly
realized by the :ref:`oa buildgraph <oa_buildgraph>` command.
The :ref:`organelle assembler <oa>`'s :program:`seeds` computes the set The :ref:`organelle assembler <oa>`'s :program:`seeds` computes the set
of seed reads. The main reason of this command if to write a new version of seed reads. The main reason of this command if to write a new version
of the file containing the set of seed reads, because its format changed. of the file containing the set of seed reads, because its format changed.
...@@ -16,11 +22,6 @@ of the file containing the set of seed reads, because its format changed. ...@@ -16,11 +22,6 @@ of the file containing the set of seed reads, because its format changed.
executes only the red task executes only the red task
.. note::
For most of the users this command is useless, because this task is automaticaly
realized by the :ref:`oa buildgraph <oa_buildgraph>` command.
command prototype command prototype
----------------- -----------------
......
Raw sequencing results (after adapter trimming) are usually provided in the fastq format, the raw result of the assembly in fasta format and the annotated result (with CDS, tRNA, ...) in the EMBL format.
.. toctree::
:maxdepth: 2
fasta
fastq
embl
...@@ -3,26 +3,9 @@ ...@@ -3,26 +3,9 @@
The ORGanelle ASseMbler principles The ORGanelle ASseMbler principles
================================== ==================================
Sequencing strategies and file formats .. include:: ./strategy.txt
--------------------------------------
Low-coverage shotgun sequencing of genomic DNA .. include:: ./assembly-graph.txt
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The resulting data of low-coverage shotgun sequencing of genomic DNA (gDNA), aka genome skimming, is the primary data used by ``ORG.asm``. If we hypophethize that the organelle genomes represent several percent of the total gDNA, even with a modest depth of sequencing of the nuclear genome (around 1x coverage), on can hope to get more than 100x coverage for the organelle genomes and repeated regions (such as rDNA clusters). This allows the reconstruction of organelle genomes and repeated regions for up to 48 samples loaded in the same HiSeq 2500 lane.
Raw sequencing results (after adapter trimming) are usually provided in the fastq format, the raw result of the assembly in fasta format and the annotated result (with CDS, tRNA, ...) in the EMBL format.
The file formats
^^^^^^^^^^^^^^^^
.. toctree::
:maxdepth: 2
fasta
fastq
embl
The ORGanelle ASseMbler commands The ORGanelle ASseMbler commands
...@@ -53,7 +36,6 @@ an assembling process. ...@@ -53,7 +36,6 @@ an assembling process.
- The orange boxed commands correspond to utility commands not required for the - The orange boxed commands correspond to utility commands not required for the
assembling but sometime useful to get or restore some information. assembling but sometime useful to get or restore some information.
The set of sub-commands can be splitted in several categories corresponding to The set of sub-commands can be splitted in several categories corresponding to
the main steps of the assembling procedure. the main steps of the assembling procedure.
...@@ -65,3 +47,11 @@ the main steps of the assembling procedure. ...@@ -65,3 +47,11 @@ the main steps of the assembling procedure.
finishing finishing
unfolding unfolding
utilities utilities
The file formats
================
.. include:: ./formats.txt
Sequencing strategy: Low-coverage shotgun sequencing of genomic DNA
-------------------------------------------------------------------
The resulting data of low-coverage shotgun sequencing of genomic DNA (gDNA), aka genome skimming, is the primary data used by ``ORG.asm``. If we hypophethize that the organelle genomes represent several percent of the total gDNA (organellar genomes can be present in more than 1000 copies in a single cell), even with a modest depth of sequencing of the nuclear genome (around 1x coverage), on can hope to get more than 100x coverage for the organelle genomes and repeated regions (such as rDNA clusters). This allows the reconstruction of organelle genomes and repeated regions for up to 48 samples loaded in the same HiSeq 2500 lane.
For example, Consider that you sequence 3.10e6 pair-end reads -> 6.10e6 reads of 100bp
==================== ============ ============
Organelle Chloroplast Mitochondria
==================== ============ ============
Belonging organelle 5% 0.5%
Effective reads 300,000 30,000
Base pairs 30.10e6 3.10e6
Genome size 150kb 16Kb
Sequencing depth 200X 187X
==================== ============ ============
This can be further observed using the k-mer frequency spectrum of a plant genome low-coverage shotgun sequencing.
The spectrum shows particular a shape with a high number of non-frequent kmer and a bimodal shape at intermediate and high frequency.
This shape can be explained by the mix of the low coverage sequencing the high coverage sequecing of the chloroplastic genome. Indeed the nuclear genome sequencing is responsible for a large number of unique or nearly unique k-mer and the high coverage sequencing of the chloroplastic genome translate into the bimodal distribution of moderatly and highly encountered k-mers, the bimodal distribution being due to the large duplicated region (Inverted Repeat) typical of the chloroplastic genome.
.. figure:: ./Kmer-histogram.*
:align: center
:figwidth: 80 %
:width: 500
major = 0 major = 0
minor = '1' minor = '1'
serial= '12' serial= '13'
version = "%d.%s.%s" % (major,minor,serial) version = "%d.%s.%s" % (major,minor,serial)
--extra-index-url https://pypi.python.org/simple/ --extra-index-url https://pypi.python.org/simple/
pip>=8.0 pip>=8.0
Cython>=0.23 Cython==0.23
Sphinx>=1.3 Sphinx>=1.3
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment