Commit d09447a7 by Eric Coissac

Start the doc

parent a65e1d5e
This source diff could not be displayed because it is too large. You can view the blob instead.
Assembling the reads
====================
.. toctree::
:maxdepth: 2
commands/buildgraph
The organelle assembler commands
================================
.. toctree::
:maxdepth: 2
preparing
assembling
finishing
unfolding
.. _oa_buildgraph:
The :program:`buildgraph` command
=================================
The :ref:`organelle assembler <oa>`'s :program:`buildgraph`
realizes the assembling of the reads by building the De Bruinj Graph which
is the central data structure used by the :ref:`organelle assembler <oa>`.
.. _oa_index:
The :program:`index` command
============================
The :ref:`organelle assembler <oa>`'s :program:`index` command indexes
lexicographicaly a set of sequence reads to make them usable by the
assembler.
The constraints of the :ref:`organelle assembler <oa>`
------------------------------------------------------
The :ref:`organelle assembler <oa>` was developed to deal with Illumina
paired-end reads. Consequently, the algortihm of the
:ref:`organelle assembler <oa>` requires that the indexed reads respect
several constraints.
- The reads must be paired.
- They must have all the same length.
Moreover:
- The read length must be odd.
- Sequences cannot contain :ref:`IUPAC <iupac_code>` ambiguity code.
- They must be formated in :ref:`fastq <fastq>`.
To be able to run on more diverse types of sequence datasets, the
:program:`index` command provides a set of options allowing to overcome these
limits and to format various kind of sequence.
These options allow for indicating:
- The sequencing strategy (paired end by default)
- Single direction sequencing
- Mate-pairs library
- The file format (fastq by default)
- Sequence file following the :ref:`fasta <fasta>` format
- File compressed by `gzip`_ (file name ending by `.gz`)
- File compressed by `bzip2`_ (file name ending by `.bz2`)
They also permit to define the read length following three strategies.
The default strategy
++++++++++++++++++++
When nothing is specified the :ref:`organelle assembler <oa>`
:program:`index <oa_index>` commandconsiders that all the reads from the
dataset have the same length. Considering this, the actual read length of
the dataset is estimated from the first read of the forward file.
If the read length is even, it is decreased by one.
During the indexing procedure, reads shorter than the limit are discarded,
read longuer than the limit are trimmed on their 3' end to fit the good
length. After the trimming reads containing :ref:`IUPAC <iupac_code>` ambiguity
code are also discarded.
The user defined read length
++++++++++++++++++++++++++++
Using the option :ref:`--length <opt_idx_length>`
The estimated read length
+++++++++++++++++++++++++
Running the :program:`index`
One of the most common way for running the :program:`index` command looks like
to the following command:
.. code-block:: bash
$ oa index --estimate-length 0.9 \
seqindex \
forward.fastq.gz reverse.fastq.gz
the :program:`index` command creates four files :
- <index>.ogx : contains information concerning the index
- <index>.ofx : contains the sequences themselves and the forward index
- <index>.orx : contains reverse index
- <index>.opx : contains read pairing data
The :ref:`organelle assembler <oa>` will need all these file to process assembling.
`<index>` represents the name of index that will be used later by the assembler.
command prototype
-----------------
.. code-block:: none
$ oa index [--single | --mate-pairs] \
[--check-ids] [--check-pairing] \
[--max-read ###] \
[--length ### | --estimate-length #.##] \
[--fasta | --forward-fasta | --reverse-fasta] \
[--no-pipe] \
<index> <forward_fastq_file> [reverse_fastq_file]
options
-------
.. program:: oa index
Sequencing strategy
+++++++++++++++++++
.. cmdoption:: --single
Single read mode.
.. cmdoption:: --mate-pairs
Indicates that the two read files were obtained using a mate pair
sequencing strategy.
Sequence file checking
++++++++++++++++++++++
.. cmdoption:: --check-ids
Checks that forward and reverse ids are identical.
The two sequence ids `seqid/1` and `seqid/2` are considered as
identical.
.. cmdoption:: --check-pairing
Ensure that forward and reverse files are correctly paired.
The pairing is checked based on the sequence identifier.
The two sequence with the ids `seqid/1` and `seqid/2` are
considered as paired.
Limit for the indexation
++++++++++++++++++++++++
.. cmdoption:: --max-read ###
`###` indicates the number of millions of reads to index. If not
specified all the reads are indexed within the limit imposed by
the program and printed at the beginning of the program trace.
.. code-block:: bash
$ oa index --max-read 4 seqindex forward.fastq reverse.fastq
Build the index with a maximum of four millons of reads.
.. _opt_idx_length:
.. cmdoption:: --length ###
`###` represents the read length to consider. Only reads
with a length greater or equal to `###` will be indexed. Reads longer
than the specified length are truncated at the specified length.
.. code-block:: bash
$ oa index --length 90 seqindex forward.fastq reverse.fastq
Indexes the `forward.fastq` and `reverse.fastq` files using only
reads longer than 90 bp.
If the :option:`--length ###` option is not used the length is estimated
from the length of the first read of the forward file or through the
:option:`--estimate-length #.##` option.
.. cmdoption:: --estimate-length #.##
`#.##` ranging between 0.0 and 1.0, indicates which fraction
of the overall dataset we want to use. When this option is used
the sequence length to index is estimated to respect this constraint.
.. code-block:: bash
$ oa index --estimate-length 0.9 seqindex forward.fastq reverse.fastq
Indexes the `forward.fastq` and `reverse.fastq` files using a length
such as at least 90% of the reads will be indexed.
Sequence file format
++++++++++++++++++++
.. cmdoption:: --fasta
Indicates than the two sequence files to index are :ref:`fasta <fasta>` files.
.. code-block:: bash
$ oa index --fasta seqindex forward.fasta reverse.fasta
.. cmdoption:: --forward-fasta
Indicates than the forward file is a fasta file
.. code-block:: bash
$ oa index --forward-fasta seqindex forward.fasta reverse.fastq
.. cmdoption:: --reverse-fasta
Indicates than the reverse file is a fasta file
.. code-block:: bash
$ oa index --reverse-fasta seqindex forward.fastq reverse.fasta
If the file names end by `.gz` or `.bz2` they are considered as compressed
respectively by the `gzip`_ or the `bzip2`_ program and are uncompressed on the
fly. The :ref:`fasta <fasta>` related options can be combined without restriction with this
feature.
.. code-block:: bash
$ oa index --reverse-fasta seqindex \
forward.fastq.gz reverse.fasta.bz2
The forward file follows the :ref:`fastq <fastq>` format and is compressed
with `gzip`_. The reverse file follow the :ref:`fasta <fasta>` format and is
compressed with `bzip2`_.
System option
+++++++++++++
.. cmdoption:: --no-pipe
By default the :ref:`organelle assembler <oa>` uses named pipes to transfer
data among programs. Using this option you can enforce to use
tempory files instead.
.. _`gzip`: http://www.gzip.org
.. _`bzip2`: http://www.bzip.org
.. _fasta:
The **fasta** format
====================
The *fasta* format is certainly the most widely used sequence file format.
This is certainly due to its great simplicity. It was originally created
for the Lipman and Pearson `FASTA program`_.
In *fasta* format a sequence is represented by a title line beginning with
a **>** character and the sequences by itself following the
:ref:`iupac <iupac_code>` code. The sequence is usually split other
severals lines of the same length (expect for the last one) ::
>my_sequence this is my pretty sequence
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
AACGACGTTGCAGTACGTTGCAGT
This is no special format for the title line excepting that this line
should be unique. Usually the first word following the **>** character
is considered as the sequence identifier. The end of the title line
corresponding to a description of the sequence.
Several sequences can be concatenated in a same file. The description of
the next sequence is just pasted at the end of the record of the previous
one ::
>sequence_A this is my first pretty sequence
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
AACGACGTTGCAGTACGTTGCAGT
>sequence_B this is my second pretty sequence
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
AACGACGTTGCAGTACGTTGCAGT
>sequence_C this is my third pretty sequence
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
AACGACGTTGCAGTACGTTGCAGT
.. _`FASTA program`: http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation
.. _fastq:
The **fastq** sequence format
=============================
.. note::
This article uses material from the Wikipedia article
`FASTQ format <http://en.wikipedia.org/wiki/FASTQ_format>`
which is released under the
`Creative Commons Attribution-Share-Alike License 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>`
**fastq format** is a text-based format for storing both a biological sequence
(usually nucleotide sequence) and its corresponding quality scores.
Both the sequence letter and quality score are encoded with a single
ASCII character for brevity. It was originally developed at the
`Wellcome Trust Sanger Institute` to bundle a :ref:`fasta` sequence and its
quality data, but has recently become the *de facto* standard for storing
the output of high throughput
sequencing instruments such as the Illumina Genome
Analyzer Illumina. [1]_
Format
------
A fastq file normally uses four lines per sequence.
- Line 1 begins with a '@' character and is followed by a sequence
identifier and an *optional* description (like a
:ref:`fasta` title line).
- Line 2 is the raw sequence letters.
- Line 3 begins with a '+' character and is *optionally* followed by
the same sequence identifier (and any description) again.
- Line 4 encodes the quality values for the sequence in Line 2, and
must contain the same number of symbols as letters in the sequence.
A fastq file containing a single sequence might look like this:
::
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
The character '!' represents the lowest quality while '~' is the
highest. Here are the quality value characters in left-to-right
increasing order of quality (`ASCII`):
::
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
The original Sanger FASTQ files also allowed the sequence and quality
strings to be wrapped (split over multiple lines), but this is generally
discouraged as it can make parsing complicated due to the unfortunate
choice of "@" and "+" as markers (these characters can also occur in the
quality string).
Variations
----------
Quality
~~~~~~~
A quality value *Q* is an integer mapping of *p* (i.e., the probability
that the corresponding base call is incorrect). Two different equations
have been in use. The first is the standard Sanger variant to assess
reliability of a base call, otherwise known as Phred quality
score:
:math:`Q_\text{sanger} = -10 \, \log_{10} p`
The Solexa pipeline (i.e., the software delivered with the Illumina
Genome Analyzer) earlier used a different mapping, encoding the
odds *p*/(1-*p*) instead of the probability *p*:
:math:`Q_\text{solexa-prior to v.1.3} = -10 \, \log_{10} \frac{p}{1-p}`
Although both mappings are asymptotically identical at higher quality
values, they differ at lower quality levels (i.e., approximately *p* >
0.05, or equivalently, *Q* < 13).
|Relationship between *Q* and *p* using the Sanger (red) and Solexa
(black) equations (described above). The vertical dotted line indicates
*p* = 0.05, or equivalently, *Q* � 13.|
Encoding
~~~~~~~~
- Sanger format can encode a Phred quality
score from 0 to 93 using ASCII 33 to 126
(although in raw read data the Phred quality score rarely exceeds 60,
higher scores are possible in assemblies or read maps).
- Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score
from -5 to 62 using ASCII 59 to 126 (although in raw read
data Solexa scores from -5 to 40 only are expected)
- Starting with Illumina 1.3 and before Illumina 1.8, the format
encoded a Phred quality score from 0 to 62
using ASCII 64 to 126 (although in raw read data Phred
scores from 0 to 40 only are expected).
- Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0
to 2 have a slightly different meaning. The values 0 and 1 are no
longer used and the value 2, encoded by ASCII 66 "B".
Sequencing Control Software, Version 2.6, Catalog # SY-960-2601, Part #
15009921 Rev. A, November
2009]\ http://watson.nci.nih.gov/solexa/Using_SCSv2.6_15009921_A.pdf\
(page 30) states the following: *If a read ends with a segment of mostly
low quality (Q15 or below), then all of the quality values in the
segment are replaced with a value of 2 (encoded as the letter B in
Illumina's text-based encoding of quality scores)... This Q2 indicator
does not predict a specific error rate, but rather indicates that a
specific final portion of the read should not be used in further
analyses.* Also, the quality score encoded as "B" letter may occur
internally within reads at least as late as pipeline version 1.6, as
shown in the following example:
::
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB
An alternative interpretation of this ASCII encoding has been
proposed. Also, in Illumina runs using PhiX controls, the character
'B' was observed to represent an "unknown quality score". The error rate
of 'B' reads was roughly 3 phred scores lower the mean observed score of
a given run.
- Starting in Illumina 1.8, the quality scores have basically returned
to the use of the Sanger format (Phred+33).
File extension
--------------
There is no standard file extension for a FASTQ
file, but .fq and .fastq, are commonly used.
See also
--------
- :ref:`fasta`
References
----------
.. [1]
Cock et al (2009) The Sanger FASTQ file format for sequences with
quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids
Research,
.. [2]
Illumina Quality Scores, Tobias Mann, Bioinformatics, San Diego,
Illumina `1 <http://seqanswers.com/forums/showthread.php?t=4721>`__
.. |Relationship between *Q* and *p* using the Sanger (red) and Solexa (black) equations (described above). The vertical dotted line indicates *p* = 0.05, or equivalently, *Q* � 13.| image:: Probability metrics.png
See http://en.wikipedia.org/wiki/FASTQ_format
......@@ -10,12 +10,11 @@ Contents:
.. toctree::
:maxdepth: 2
Sequencing strategies and file formats <strategies>
The ORGanelle ASeMbler ``oa`` command line tool <oa>
Assembling a mitochondrion genome <mitochondrion>
Algorithms <algorithms>
User API <userapi>
The organelle assembler indexer <orgasmi>
Indices and tables
......@@ -24,4 +23,3 @@ Indices and tables
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
.. _orgasmi:
The organelle assembler indexer
===============================
the :program:`orgasmi` program indexes lexicographicaly a set of paired Illumina
reads stored in fastq format. All the indexed reads must have the same length.
Two options of the indexer allow to index only a subset of the reads and/or
only the beginning of the reads over a specified length.
usage
-----
.. code-block:: bash
$ orgasmi -o <index> <forward_fastq_file> <reverse_fastq_file>
options
-------
.. program:: obitools
.. cmdoption:: -h
Shows the help message and exits.
.. cmdoption:: -o <index> : The name of the index ouput files
orgasmi creates four files :
- <index>.ogx : contains information concerning the index
- <index>.ofx : contains the sequences themselves and the forward index
- <index>.orx : contains reverse index
- <index>.opx : contains read pairing data
The assembler will need all these file to process assembling
.. cmdoption:: -M ###
If specified, ### represents the count in million of reads to index
.. cmdoption:: -l ###
If specified ### represents the read length to consider. Only reads
with a length greater or equal to ### will be indexed. Reads longer
than the specified length are truncated at the specified length.
If the :option:`-l <orgasmi -l>` is not used the length is estimated from
the length of the first read of the forward file.
.. _iupac_code:
The IUPAC code
==============
The International Union of Pure and Applied Chemistry (IUPAC_) defined
the standard code for representing protein or DNA sequences.
Nucleic IUPAC Code
------------------
======== =================================
**Code** **Nucleotide**
======== =================================
A Adenine
C Cytosine
G Guanine
T Thymine
U Uracil
R Purine (A or G)
Y Pyrimidine (C, T, or U)
M C or A
K T, U, or G
W T, U, or A
S C or G
B C, T, U, or G (not A)
D A, T, U, or G (not C)
H A, T, U, or C (not G)
V A, C, or G (not T, not U)
N Any base (A, C, G, T, or U)
======== =================================
Peptidic one and three letters IUPAC code
-----------------------------------------
============ ============= =======================================
**1-letter** **3-letters** **Amino acid**
============ ============= =======================================
A Ala Alanine
R Arg Arginine
N Asn Asparagine
D Asp Aspartic acid
C Cys Cysteine
Q Gln Glutamine
E Glu Glutamic acid
G Gly Glycine
H His Histidine
I Ile Isoleucine
L Leu Leucine
K Lys Lysine
M Met Methionine
F Phe Phenylalanine
P Pro Proline
S Ser Serine
T Thr Threonine
W Trp Tryptophan
Y Tyr Tyrosine
V Val Valine
B Asx Aspartic acid or Asparagine
Z Glx Glutamine or Glutamic acid
X Xaa Any amino acid
============ ============= =======================================
.. _IUPAC: http://www.iupac.org/
.. _oa:
The organelle assembler command line
====================================
.. toctree::
:maxdepth: 3
commands
Preparing the data before the assembling
========================================
.. toctree::
:maxdepth: 2
commands/index
Sequencing strategies and file formats
======================================
The file formats
----------------
.. toctree::
:maxdepth: 2
fasta
fastq
iupac
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment