obitaxonomy.py 17 KB
Newer Older
1 2
#!/usr/local/bin/python
'''
Aurélie Bonin committed
3
:py:mod:`obitaxonomy`: manages taxonomic databases
Celine Mercier committed
4
==================================================
5 6 7

.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org> and Celine Mercier <celine.mercier@metabarcoding.org>

Aurélie Bonin committed
8
The :py:mod:`obitaxonomy` command can generate an ecoPCR database from a NCBI taxdump 
Celine Mercier committed
9
(see NCBI ftp site) and allows managing the taxonomic data contained in both types of 
10
database.
11

12
Several types of editing are possible:
13

14
**Adding a taxon to the database**
15

16
    The new taxon is described by three values: 
Aurélie Bonin committed
17
    its scientific name, its taxonomic rank, and the *taxid* of its first ancestor.
18 19
    Done by using the ``-a`` option. 
    
Aurélie Bonin committed
20
**Deleting a taxon from the database**
21

Aurélie Bonin committed
22
    Erases a local taxon. Done by using the ``-D`` option and specifying a *taxid*. 
23 24
    
**Adding a species to the database**
25

26 27 28 29
    The genus of the species must already exist in the database. The species will be 
    added under its genus. Done by using the ``-s`` option and specifying a species 
    scientific name. 
    
Aurélie Bonin committed
30
**Adding a preferred scientific name for a taxon in the database**
31

32
    Adds a preferred name for a taxon in the taxonomy, by specifying the new favorite 
Aurélie Bonin committed
33
    name and the *taxid* of the taxon whose preferred name should be changed. 
34 35
    Done by using the ``-f`` option.
    
Frédéric Boyer committed
36
**Adding all the taxa from a sequence file in the ``OBITools`` extended :doc:`fasta <../fasta>` format to the database**
37

Frédéric Boyer committed
38
    All the taxon from a file in the ``OBITools`` extended :doc:`fasta <../fasta>` format, and eventually their ancestors, are added to the 
39 40 41 42 43 44 45 46 47
    taxonomy database.
    
    The header of each sequence record must contain the attribute defined by the 
    ``-k`` option (default key: ``species_name``), whose value is the scientific name 
    of the taxon to be added.
    
    A taxonomic path for each sequence record can be specified with the ``-p`` option, 
    as the attribute key that contains the taxonomic path of the taxon to be added. 
    
Aurélie Bonin committed
48 49
    A restricting ancestor can be specified with the ``-A`` option, either as a *taxid* 
    (integer) or a key (string). If it is a *taxid*, this *taxid* is the default *taxid* 
50
    under which the new taxon is added if none of his ancestors are specified or can 
Aurélie Bonin committed
51
    be found. If it is a key, :py:mod:`obitaxonomy` looks for the ancestor *taxid* in 
52 53 54 55 56 57 58 59 60
    the corresponding attribute, and the new taxon is systematically added under this 
    ancestor. By default, the restricting ancestor is the root of the taxonomic tree for
    all the new taxa.
    
    If neither a path nor an ancestor is specified in the header of the sequence record,
    :py:mod:`obitaxonomy` tries to read the taxon name as a species name and to find the 
    genus in the taxonomic database. If the genus is found, the new taxon is added under it. 
    If not, it is added under the restricting ancestor. 
    
Aurélie Bonin committed
61 62
    It is highly recommended checking what was exactly done by reading the output, 
    since :py:mod:`obitaxonomy` uses *ad hoc* parsing and decision rules.
63 64 65 66 67
    
    Done by using the ``-F`` option. 

**Notes:**

Aurélie Bonin committed
68
- When a taxon is added, a new *taxid* is assigned to it. The minimum for the new *taxids* 
69 70 71
  can be specified by the ``-m`` option and is equal to 10000000 by default.

- For each modification, a line is printed with details on what was done.
72

73 74
'''

75

76 77
from obitools.options.taxonomyfilter import addTaxonomyDBOptions,loadTaxonomyDatabase
from obitools.options import getOptionManager
78
from obitools.ecopcr.taxonomy import ecoTaxonomyWriter
79
from obitools.fasta import fastaIterator
80 81 82
import sys


83
def addTaxonFromFile(name, rank, parent, options) :
84
    
85
    taxid = options.taxonomy.addLocalTaxon(name, rank, parent, options.taxashift)
86 87
    taxon = options.taxonomy.findTaxonByTaxid(taxid)
    parent= options.taxonomy._taxonomy[taxon[2]]
88
    
89 90
#    if options.write == '' :
    print>>sys.stderr, "added : %-40s\t%-15s\t%-8d\t->\t%s [%d] (%s)" % (taxon[3],options.taxonomy._ranks[taxon[1]],
91 92
                                                                            taxon[0],
                                                                            parent[3],parent[0],options.taxonomy._ranks[parent[1]])
93 94 95 96
#    else :
#        print>>options.write, "added : %-40s\t%-15s\t%-8d\t->\t%s [%d] (%s)" % (taxon[3],options.taxonomy._ranks[taxon[1]],
#                                                                           taxon[0],
#                                                                           parent[3],parent[0],options.taxonomy._ranks[parent[1]])
97 98 99 100 101 102 103 104 105 106
    return taxid


def numberInStr(s) :
    containsNumber = False
    for c in s :
        if c.isdigit() :
            containsNumber = True
    return containsNumber

107

108 109 110 111 112
def editTaxonomyOptions(optionManager):
    optionManager.add_option('-a','--add-taxon',
                             action="append", dest="newtaxon",
                             metavar="<taxon_name>:rank:parent",
                             default=[],
113
                             help="Adds a new taxon to the taxonomy. The new taxon "
114
                                  "is described by three values separated by colons: "
115 116 117
                                  "the scientific name, the rank of the new taxon, "
                                  "the taxid of the parent taxon")
    
Eric Coissac committed
118 119
    optionManager.add_option('-D','--delete-local-taxon',
                             action="append", dest="deltaxon",
Celine Mercier committed
120
                             metavar="<TAXID>",
Eric Coissac committed
121 122 123
                             default=[],
                             help="Erase a local taxon")

124 125
    optionManager.add_option('-s','--add-species',
                             action="append", dest="newspecies",
Celine Mercier committed
126
                             metavar="<SPECIES_NAME>",
127
                             default=[],
Celine Mercier committed
128
                             help="Adds a new species to the taxonomy. The new species "
129 130
                                  "is described by its scientific name")
    
131 132 133 134 135
    optionManager.add_option('-F','--add-file',
                             action="store", dest="species_file",
                             metavar="<file name>",
                             default=None,
                             help="Add all the species from a fasta file to the taxonomy. The header of"
136
                                  " the sequences must contain the field defined by the -k option")
137
    
138 139 140
    optionManager.add_option('-k','--key_name',
                             action="store", dest="key_name",
                             metavar="<key name>",
141
                             default='species_name',
142 143
                             help="Name of the attribute key used to find the species names in the headers "
                                  "when the -F option is used. "
144 145
                                  "Default = 'species_name'")
    
146 147 148 149 150
    optionManager.add_option('-f','--add-favorite-name',
                             action="append", dest="newname",
                             metavar="<taxon_name>:taxid",
                             default=[],
                             help="Add a new favorite name to the taxonomy. The new name "
Celine Mercier committed
151
                                  "is described by two values separated by a colon. "
152 153 154 155
                                  "the new favorite name and the taxid of the taxon")
                             
    optionManager.add_option('-m','--min-taxid',
                             action="store", dest="taxashift",
156
                             type="int",
157 158 159
                             metavar="####",
                             default=10000000,
                             help="minimal taxid for the newly added taxid")
160 161 162 163
    
    optionManager.add_option('-A','--restricting_ancestor',
                             action="store", dest="res_anc",
                             type="str",
Celine Mercier committed
164
                             metavar="<ANCESTOR>",
165 166 167 168 169 170 171
                             default='',
                             help="works with the -F option. Can be a word or a taxid (number). Enables to restrict the "
                                  "adding of taxids under a specified ancestor. If it's a word, it's the field containing "
                                  "the ancestor's taxid in each sequence's header (can be different for each sequence). If "
                                  "it's a number, it's the taxid of the ancestor (in which case it's the same for all the sequences)."
                                  " All the sequences in the file for which the genus can't be found will be added under this ancestor.")
    
172 173 174 175 176 177
#    optionManager.add_option('-w','--write_in_file',
#                             action="store", dest="write",
#                             metavar="<write_in_file>",
#                             type = "str", default='',
#                             help="works with the -F option. Writes all the taxa added in the specified file instead of in the console screen."
#                             " Useful for big and/or problematic files.")
178 179 180 181 182 183 184
    
    optionManager.add_option('-p','--path',
                             action="store", dest="path",
                             type="str",
                             metavar="<path>",
                             default='',
                             help="works with the -F option. Field name for the taxonomy path of the taxa if they are in the headers of the sequences. "
185 186
                             "Must be of the form 'Fungi,Agaricomycetes,Thelephorales,Thelephoraceae' with the highest ancestors"
                             " first and ',' as separators between ancestors")
187
    
188 189 190 191 192 193
#    optionManager.add_option('-P','--force_ancestor',
#                             action="store_true", dest="force_ancestor",
#                             metavar="<force_ancestor>",
#                             default=False,
#                             help="works with the -A option when the ancestor is in the header. Forces the adding of the species under the ancestor specified."
#                             " /!\ the ancestor must exist. Use taxonomy paths (-p option) if you want the ancestor(s) to be created too.")
194
                             
195 196
if __name__ == '__main__':
    
197
    optionParser = getOptionManager([addTaxonomyDBOptions,editTaxonomyOptions])
198 199 200 201 202

    (options, entries) = optionParser()
    
    loadTaxonomyDatabase(options)
    
203 204
    localdata=False
    
205 206
#     if options.write != '' :
#         options.write = open(options.write, 'w')
207
    
208
    for t in options.newtaxon:
Eric Coissac committed
209
        tx = t.split(':')
210 211 212 213 214 215 216
        taxid = options.taxonomy.addLocalTaxon(tx[0].strip(),tx[1],tx[2],options.taxashift)
        taxon = options.taxonomy.findTaxonByTaxid(taxid)
        parent= options.taxonomy._taxonomy[taxon[2]]
        print "added : %-40s\t%-15s\t%-8d\t->\t%s [%d] (%s)" % (taxon[3],options.taxonomy._ranks[taxon[1]],
                                                     taxon[0],
                                                     parent[3],parent[0],options.taxonomy._ranks[parent[1]])
        localdata=True
Eric Coissac committed
217 218 219 220 221 222 223
    
#    for t in options.deltaxon:
#        tx = int(t)
#        taxon = options.taxonomy.removeLocalTaxon(tx)
#        print "removed : %-40s\t%-15s\t%-8d" % (taxon[3],options.taxonomy._ranks[taxon[1]],
#                                                     taxon[0])
#        localdata=True
224 225 226 227
    
    
    if options.species_file != None :
        
228 229
        useless_words = ['fungal','fungi','endophyte','unknown','mycorrhizal','uncultured','Uncultured','ectomycorrhiza', \
                         'ectomycorrhizal','mycorrhizal','vouchered','unidentified','bacterium','Bacterium']
230
        
231
        if options.res_anc == '' :
232
            restricting_ancestor = 1
233
            resAncInHeader = False
234 235
        elif options.res_anc.isdigit() :
            restricting_ancestor = int(options.res_anc)
236 237 238 239
            resAncInHeader = False
        else :
            resAncInHeader = True
            
240 241
        for seq in fastaIterator(options.species_file) :
            
242 243 244 245 246
            if resAncInHeader :
                if options.res_anc in seq :
                    restricting_ancestor = int(seq[options.res_anc])
                else :
                    restricting_ancestor = 1
247
            
248
            t = seq[options.key_name]
249 250 251 252 253 254 255 256 257
      
            key_error = False
            taxid = None
            # check if the taxon isn't already in the taxonomy with the right ancestor
            try :
                possible_taxids = options.taxonomy.findTaxonByName(t)
                for p in possible_taxids :
                    if options.taxonomy.isAncestor(restricting_ancestor, p[0]) :
                        taxid = p[0]
258
                        
259 260 261 262 263 264 265 266 267 268 269
            except KeyError :
                key_error = True
                
            if key_error or taxid is None :
                
                if (resAncInHeader and options.res_anc in seq) :
                    taxid = addTaxonFromFile(t,'species',restricting_ancestor,options)
           
                elif options.path != '' :   
                    previous = options.taxonomy.findTaxonByTaxid(restricting_ancestor)
                    if seq[options.path] != '' :
270
                        ancestors = [a for a in seq[options.path].split(',')]
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290
                        if ancestors[-1] != t :
                            ancestors.append(t)
                    else :     # useful when data is from UNITE databases but could disappear
                        if len(t.split(' ')) >= 2 and not numberInStr(t) :
                            genus, trash = t.split(" ",1)
                            ancestors = [genus, t]
                        else :
                            ancestors = [t]
                    for a in ancestors :
                        try:
                            possible_previous = options.taxonomy.findTaxonByName(a)
                            keyError = True
                            for p in possible_previous :
                                if options.taxonomy.isAncestor(restricting_ancestor, p[0]) :
                                    previous = p
                                    keyError = False
                            if keyError :
                                raise KeyError()
                            
                        except KeyError :
291 292 293 294 295 296 297 298
                            if (len(ancestors) > 1 and a == ancestors[-2] and len(ancestors[-1].split(' ')) >= 2 and ((not numberInStr(a)) or 'sp' in a.split(' '))) :      #a voirrrrr, trop restrictif ?
                                rank = 'genus'
                            elif a == ancestors[-1] :
                                rank = 'species'
                            else :
                                rank = 'no rank'
                            taxid = addTaxonFromFile(a,rank,previous[0],options)
                            previous = (taxid, options.taxonomy.findRankByName(rank))
299
                        
300 301
                else :
            
302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325
                    if (len(t.split(' ')) >= 2 and (not numberInStr(t)  or 'sp' in t.split(' ') or t[0].isupper()) \
                        and t.split(' ')[0] not in useless_words) :
                        
                        genus,species = t.split(" ",1)
                        
                        try :
                            possible_genuses = options.taxonomy.findTaxonByName(genus)
                            genus_taxid = None
                            for g in possible_genuses :
                                if options.taxonomy.isAncestor(restricting_ancestor, g[0]) :
                                    genus_taxid = g[0]
                        except KeyError :
                            genus_taxid = addTaxonFromFile(genus,'genus',restricting_ancestor,options)
                        
                        if genus_taxid is None :    # Genuses matching the name were found but they weren't under the restricting ancestor
                            parent = restricting_ancestor
                        else :
                            parent = genus_taxid
                        taxid = addTaxonFromFile(t, 'species', parent, options)
                        
                    else :
                        taxid = addTaxonFromFile(t, 'species', restricting_ancestor, options)
                
                localdata=True
326 327 328 329
                
#            seq['taxid'] = taxid
#            print formatFasta(seq)
            
330

331 332 333
        
    for t in options.newspecies:
        genus,species = t.split(" ",1)
334
        parent = options.taxonomy.findTaxonByName(genus)[0]
335 336 337 338 339 340 341 342
        taxid = options.taxonomy.addLocalTaxon(t,'species',parent[0],options.taxashift)
        taxon = options.taxonomy.findTaxonByTaxid(taxid)
        parent= options.taxonomy._taxonomy[taxon[2]]
        print "added : %-40s\t%-15s\t%-8d\t->\t%s [%d] (%s)" % (taxon[3],options.taxonomy._ranks[taxon[1]],
                                                     taxon[0],
                                                     parent[3],parent[0],options.taxonomy._ranks[parent[1]])
        localdata=True

343 344 345
    for t in options.newname:
        tx = t.split(':')
        taxid = options.taxonomy.addPreferedName(int(tx[1]), tx[0].strip())
346 347 348
        print "name : %8d\t->\t%s" % (taxid,options.taxonomy.getPreferedName(taxid))
             
    ecoTaxonomyWriter(options.ecodb,options.taxonomy,onlyLocal=True)
349

350