Commit 8571799f by Frédéric Boyer

Reorganised paragraph

parent 7536e811
......@@ -13,42 +13,9 @@ The ENA flat-file format
The entries in the database are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English, and the symbols and formatting employed for the base sequences themselves have been chosen for readability. Wherever possible, symbols familiar to molecular biologists have been used. At the same time, the structure is systematic enough to allow computer programs easily to read, identify, and manipulate the various types of data included. Each entry in the database is composed of lines. Different types of lines, each with its own format, are used to record the various types of data which make up the entry. In general, fixed format items have been kept to a minimum, and a more syntax-oriented structure adopted for the lines. The two exceptions to this are the sequence data lines and the feature table lines, for which a fixed format was felt to offer significant advantages to the user. Users who write programs to process the database entries should not make any assumptions about the column placement of items on lines other than these two: all other line types are free-format.
Note that each line begins with a two-character line code, which indicates the type of information contained in the line. The currently used line types, along with their respective line codes, are listed below:
Note that each line begins with a two-character line code, which indicates the type of information contained in the line.
==================================== =================================
ID - identification (begins each entry; 1 per entry)
AC - accession number (>=1 per entry)
PR - project identifier (0 or 1 per entry)
DT - date (2 per entry)
DE - description (>=1 per entry)
KW - keyword (>=1 per entry)
OS - organism species (>=1 per entry)
OC - organism classification (>=1 per entry)
OG - organelle (0 or 1 per entry)
RN - reference number (>=1 per entry)
RC - reference comment (>=0 per entry)
RP - reference positions (>=1 per entry)
RX - reference cross-reference (>=0 per entry)
RG - reference group (>=0 per entry)
RA - reference author(s) (>=0 per entry)
RT - reference title (>=1 per entry)
RL - reference location (>=1 per entry)
DR - database cross-reference (>=0 per entry)
CC - comments or notes (>=0 per entry)
AH - assembly header (0 or 1 per entry)
AS - assembly information (0 or >=1 per entry)
FH - feature table header (2 per entry)
FT - feature table data (>=2 per entry)
XX - spacer line (many per entry)
SQ - sequence header (1 per entry)
CO - contig/construct line (0 or >=1 per entry)
bb - (blanks) sequence data (>=1 per entry)
// - termination line (ends each entry; 1 per entry)
==================================== =================================
Note that some entries will not contain all of the line types, and some line types occur many times in a single entry. As indicated, each entry begins with an identification line (ID) and ends with a terminator line (//). The various line types appear in entries in the order in which they are listed above (except for XX lines which may appear anywhere between the ID and SQ lines). A detailed description of each line type is given in the following sections.
::
Example of an entry::
ID X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
XX
......@@ -151,3 +118,41 @@ Note that some entries will not contain all of the line types, and some line typ
agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac 1800
tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859
//
The currently used line types, along with their respective line codes, are listed below:
==================================== =================================
ID - identification (begins each entry; 1 per entry)
AC - accession number (>=1 per entry)
PR - project identifier (0 or 1 per entry)
DT - date (2 per entry)
DE - description (>=1 per entry)
KW - keyword (>=1 per entry)
OS - organism species (>=1 per entry)
OC - organism classification (>=1 per entry)
OG - organelle (0 or 1 per entry)
RN - reference number (>=1 per entry)
RC - reference comment (>=0 per entry)
RP - reference positions (>=1 per entry)
RX - reference cross-reference (>=0 per entry)
RG - reference group (>=0 per entry)
RA - reference author(s) (>=0 per entry)
RT - reference title (>=1 per entry)
RL - reference location (>=1 per entry)
DR - database cross-reference (>=0 per entry)
CC - comments or notes (>=0 per entry)
AH - assembly header (0 or 1 per entry)
AS - assembly information (0 or >=1 per entry)
FH - feature table header (2 per entry)
FT - feature table data (>=2 per entry)
XX - spacer line (many per entry)
SQ - sequence header (1 per entry)
CO - contig/construct line (0 or >=1 per entry)
bb - (blanks) sequence data (>=1 per entry)
// - termination line (ends each entry; 1 per entry)
==================================== =================================
Note that some entries will not contain all of the line types, and some line types occur many times in a single entry. As indicated, each entry begins with an identification line (ID) and ends with a terminator line (//). The various line types appear in entries in the order in which they are listed above (except for XX lines which may appear anywhere between the ID and SQ lines).
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment