ngsfilter.py 16.3 KB
Newer Older
Eric Coissac's avatar
Eric Coissac committed
1 2
#!/usr/local/bin/python
'''
Aurélie Bonin's avatar
Aurélie Bonin committed
3
:py:mod:`ngsfilter` : Assigns sequence records to the corresponding experiment/sample based on DNA tags and primers
4
===================================================================================================================
Eric Coissac's avatar
Eric Coissac committed
5

Frédéric Boyer's avatar
Frédéric Boyer committed
6 7
.. codeauthor:: Eric Coissac <eric.coissac@metabarcoding.org>

8 9 10
To distinguish between sequences from different PCR products pooled in the same sequencing library, pairs of small DNA 
sequences (call tags, see the :py:mod:`oligoTag` command and its associated paper for more informations on the design 
of such tags) can be concatenated to the PCR primers. 
Frédéric Boyer's avatar
Frédéric Boyer committed
11

12 13 14
:py:mod:`ngsfilter` takes as input sequence record files and a file describing the DNA tags and primers sequences used 
for each PCR sample. :py:mod:`ngsfilter` allows to demultiplex sequence records file by identifying these DNA tags and 
the primers.
Frédéric Boyer's avatar
Frédéric Boyer committed
15

16 17 18 19
:py:mod:`ngsfilter` requires a sample description file containing the description of the primers and tags associated 
to each sample (specified by option ``-t``). The sample description file is a text file where each line describes one 
sample. Columns are separated by space or tab characters. Lines beginning with the '#' character will be considered 
as commentary lines and will simply be ignored by :py:mod:`ngsfilter`. 
Frédéric Boyer's avatar
Frédéric Boyer committed
20

21
Here is an example of a sample description file::
Frédéric Boyer's avatar
Frédéric Boyer committed
22
    
23 24 25 26 27 28 29
    #exp   sample     tags                   forward_primer       reverse_primer              extra_information
    gh     01_11a     cacgcagtc:cacgcatcg    GGGCAATCCTGAGCCAA    CCATTGAGTCTCTGCACCTATC    F @ community=Festuca; bucket=1; extraction=1;
    gh     01_12a     cacgcatcg:cacgcagtc    GGGCAATCCTGAGCCAA    CCATTGAGTCTCTGCACCTATC    F @ community=Festuca; bucket=1; extraction=2;
    gh     01_21a     cacgcgcat:cacgctact    GGGCAATCCTGAGCCAA    CCATTGAGTCTCTGCACCTATC    F @ community=Festuca; bucket=2; extraction=1;
    gh     01_22a     cacgctact:cacgcgcat    GGGCAATCCTGAGCCAA    CCATTGAGTCTCTGCACCTATC    F @ community=Festuca; bucket=2; extraction=2;
    gh     02_11a     cacgctgag:cacgtacga    GGGCAATCCTGAGCCAA    CCATTGAGTCTCTGCACCTATC    F @ community=Festuca; bucket=1; extraction=1;
    gh     02_12a     cacgtacga:cacgctgag    GGGCAATCCTGAGCCAA    CCATTGAGTCTCTGCACCTATC    F @ community=Festuca; bucket=1; extraction=2;
Frédéric Boyer's avatar
Frédéric Boyer committed
30 31


32 33 34 35
The results consist of sequence records, printed on the standard output, with their sequence trimmed of the primers and 
tags and annotated with the corresponding experiment and sample (and possibly some extra informations). Sequences for 
which the tags and primers have not been well identified, and which are thus unassigned to any sample, are stored in a 
file if option ``-u`` is specified and tagged as erroneous sequences (``error`` attribute) by :py:mod:`ngsfilter`. 
Eric Coissac's avatar
Eric Coissac committed
36
'''
37

38
from obitools import NucSequence, DNAComplementSequence
Eric Coissac's avatar
Eric Coissac committed
39 40
from string import lower

41
import sys
Eric Coissac's avatar
Eric Coissac committed
42
  
Eric Coissac's avatar
Eric Coissac committed
43 44 45 46
import math

from obitools.options import getOptionManager
from obitools.utils import ColumnFile
47
from obitools.align import FreeEndGapFullMatch
Eric Coissac's avatar
Eric Coissac committed
48
from obitools.format.options import addInOutputOption, sequenceWriterGenerator
Eric Coissac's avatar
Eric Coissac committed
49 50 51 52 53



def addNGSOptions(optionManager):
    
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
    group = optionManager.add_option_group('ngsfilter specific options')
    group.add_option('-t','--tag-list',
                     action="store", dest="taglist",
                     metavar="<FILENAME>",
                     type="string",
                     default=None,
                     help="File containing the samples definition (with tags, primers, sample names,...)")
    
    group.add_option('-u','--unidentified',
                     action="store", dest="unidentified",
                     metavar="<FILENAME>",
                     type="string",
                     default=None,
                     help="Filename used to store the sequences unassigned to any sample")
    
    group.add_option('-e','--error',
                     action="store", dest="error",
                     metavar="###",
                     type="int",
                     default=2,
                     help="Number of errors allowed for matching primers [default = 2]")
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
    

class Primer:
    
    collection={}
    
    def __init__(self,sequence,taglength,direct=True,error=2,verbose=False):
        '''
            
        @param sequence:
        @type sequence:
        @param direct:
        @type direct:
        '''
        
        assert sequence not in Primer.collection        \
            or Primer.collection[sequence]==taglength,  \
            "Primer %s must always be used with tags of the same length" % sequence
            
        Primer.collection[sequence]=taglength
        
        self.raw=sequence
        self.sequence = NucSequence('primer',sequence)
        self.lseq = len(self.sequence)
99
        self.align=FreeEndGapFullMatch()
100 101 102 103
        self.align.match=4
        self.align.mismatch=-2
        self.align.opengap=-2
        self.align.extgap=-2
104 105
        self.error=error
        self.minscore = (self.lseq-error) * self.align.match + error * self.align.mismatch
106 107
        if verbose:
            print >>sys.stderr,sequence,":",self.lseq,"*",self.align.match,"+",error,"*",self.align.mismatch,"=",self.minscore
108 109 110 111 112 113 114 115 116 117
        self.taglength=taglength
        
        self.align.seqB=self.sequence
        
        self.direct = direct
        self.verbose=verbose
        
    def complement(self):
        p = Primer(self.raw,
                  self.taglength,
118 119
                  not self.direct,verbose=self.verbose,
                  error=self.error)
120 121 122 123 124 125 126 127 128 129 130 131 132
        p.sequence=p.sequence.complement()
        p.align.seqB=p.sequence
        return p
    
    def __hash__(self):
        return hash(str(self.raw))
    
    def __eq__(self,primer):
        return self.raw==primer.raw 
    
    def __call__(self,sequence):
        if len(sequence) <= self.lseq:
            return None
133 134
        if self.verbose:
            print >>sys.stderr,len(sequence) , self.lseq,len(sequence) < self.lseq
135 136 137 138 139 140 141 142 143 144
        self.align.seqA=sequence
        ali=self.align()
        if self.verbose:
            print >>sys.stderr,ali
            print >>sys.stderr,"Score : %d  Minscore : %d \n" %(ali.score,self.minscore)
            
        if ali.score >= self.minscore:
            score = ali.score
            start = ali[1].gaps[0][1]
            end = len(ali[1])-ali[1].gaps[-1][1]
145 146 147 148 149 150
            if self.taglength is not None:
                if isinstance(self.sequence, DNAComplementSequence):
                    if (len(sequence)-end) >= self.taglength:
                        tag=str(sequence[end:end+self.taglength].complement())
                    else:
                        tag=None
151
                else:
152 153 154 155
                    if start >= self.taglength:                
                        tag=str(sequence[start - self.taglength:start])
                    else:
                        tag=None
156
            else:
157
                tag=None
158 159 160 161 162 163 164 165 166 167 168
                
            return score,start,end,tag

        return None 
    
    def __str__(self):
        return "%s: %s" % ({True:'D',False:'R'}[self.direct],self.raw)
    
    __repr__=__str__
        
    
169 170 171
def tagpair(x):
    x=tuple(lower(y.strip()) for y in x.split(':'))
    if len(x)==1:
172
        x = (x[0],x[0])
173
    return x
Eric Coissac's avatar
Eric Coissac committed
174 175 176 177 178 179

def readTagfile(filename):
    """
    data file describing tags and primers for each sample
    is a space separated tabular file following this format
    
180
    experiment sample forward_tag reverse_tag forward_primer reverse_primer partial
Eric Coissac's avatar
Eric Coissac committed
181 182
    
    
183
    tags can be specified as - if no tag are used
Eric Coissac's avatar
Eric Coissac committed
184
    """
185

Eric Coissac's avatar
Eric Coissac committed
186
    tab=ColumnFile(filename,strip=True,
187
                            types=(str,str,tagpair,lower,lower,bool),
188
                            head=('experiment','sample',
189
                                  'tags',
190 191
                                  'forward_primer','reverse_primer',
                                  'partial'),
Eric Coissac's avatar
Eric Coissac committed
192 193 194
                            skip="#",
                            extra="@")
    
195 196
    primers = {}

Eric Coissac's avatar
Eric Coissac committed
197
    for p in tab:
198
        forward=Primer(p['forward_primer'],
199
                       len(p['tags'][0]) if p['tags'][0]!='-' else None,
200
                       True,
201
                       error=options.error,verbose=options.debug)
Eric Coissac's avatar
Eric Coissac committed
202
        
203 204
        fp = primers.get(forward,{})
        primers[forward]=fp
Eric Coissac's avatar
Eric Coissac committed
205
        
206
        reverse=Primer(p['reverse_primer'],
207
                       len(p['tags'][1]) if p['tags'][1]!='-' else None,
208
                       False,
209
                       error=options.error,verbose=options.debug)
210 211 212 213 214 215 216 217 218 219 220 221
        
        rp = primers.get(reverse,{})
        primers[reverse]=rp
        
        cf=forward.complement()
        cr=reverse.complement()
        
        dpp=fp.get(cr,{})
        fp[cr]=dpp
        
        rpp=rp.get(cf,{})
        rp[cf]=rpp
Eric Coissac's avatar
Eric Coissac committed
222
            
223 224
        tags = (p['tags'][0] if p['tags'][0]!='-' else None,
                p['tags'][1] if p['tags'][1]!='-' else None)
Eric Coissac's avatar
Eric Coissac committed
225
        
226 227 228 229 230 231 232 233 234 235 236
        assert tags not in dpp, \
               "tag pair %s is already used with primer pairs : (%s,%s)" % (str(tags),forward,reverse)
               
        extras = p.get('__extra__',{})
        data   ={'experiment':p['experiment'],
                   'sample':    p['sample']
                }
        data.update(extras)
        
        dpp[tags]=data
        rpp[tags]=data
Eric Coissac's avatar
Eric Coissac committed
237 238
        
                
239 240 241 242 243
        if p['partial']:
            dpartial = fp.get(None,{})
            fp[None]=dpartial
            rpartial = rp.get(None,{})
            rp[None]=rpartial
Eric Coissac's avatar
Eric Coissac committed
244
            
245 246
            dt = [x for x in dpartial if x[0]==tags[0]]
            rt = [x for x in rpartial if x[1]==tags[1]]
Eric Coissac's avatar
Eric Coissac committed
247
            
248 249 250 251 252
            assert not(dt) and not(rt), \
                "partial fragment are not usable with primer pair : (%s,%s)" % (forward,reverse)
                
            dpartial[tags]=data
            rpartial[tags]=data
Eric Coissac's avatar
Eric Coissac committed
253

254 255
    return primers
            
Eric Coissac's avatar
Eric Coissac committed
256 257

def annotate(sequence,options):
258 259 260 261 262 263 264 265 266 267 268 269
        
    def sortMatch(m1,m2):
        if m1[1] is None and m2[1] is None:
            return 0
        
        if m1[1] is None:
            return 1
        
        if m2[1] is None:
            return -1
        
        return cmp(m1[1][1],m2[1][2])
Eric Coissac's avatar
Eric Coissac committed
270 271 272 273 274 275 276 277 278 279 280 281
    
    if hasattr(sequence, "quality"):
        q = -reduce(lambda x,y:x+y,(math.log10(z) for z in sequence.quality),0)/len(sequence.quality)*10
        sequence['avg_quality']=q
        q = -reduce(lambda x,y:x+y,(math.log10(z) for z in sequence.quality[0:10]),0)
        sequence['head_quality']=q
        if len(sequence.quality[10:-10]) :
            q = -reduce(lambda x,y:x+y,(math.log10(z) for z in sequence.quality[10:-10]),0)/len(sequence.quality[10:-10])*10
            sequence['mid_quality']=q
        q = -reduce(lambda x,y:x+y,(math.log10(z) for z in sequence.quality[-10:]),0)
        sequence['tail_quality']=q
        
282
    primers = options.primers
283 284
    if options.debug:
        print >>sys.stderr,"directmatch"
285 286 287 288 289 290
    directmatch = [(p,p(sequence)) for p in primers]
    
    
    directmatch.sort(cmp=sortMatch)
    directmatch=directmatch[0] if directmatch[0][1] is not None else None
    
291 292
    if options.debug:
        print  >>sys.stderr,">>>>",directmatch
293 294 295 296 297 298
    if directmatch is None:
        sequence['error']='No primer match'
        return False,sequence
    
    match=str(sequence[directmatch[1][1]:directmatch[1][2]])
    
299
    sequence['seq_length_ori']=len(sequence)
300 301 302 303 304 305 306 307
    
    sequence = sequence[directmatch[1][2]:]
    
    if directmatch[0].direct:
        sequence['direction']='forward'
        sequence['forward_score']=directmatch[1][0]
        sequence['forward_primer']=directmatch[0].raw
        sequence['forward_match']=match
Eric Coissac's avatar
Eric Coissac committed
308
        
309 310 311 312 313
    else:
        sequence['direction']='reverse'
        sequence['reverse_score']=directmatch[1][0]
        sequence['reverse_primer']=directmatch[0].raw
        sequence['reverse_match']=match
314
        
315 316 317
    del sequence['cut']
    
    primers = options.primers[directmatch[0]]
318 319
    if options.debug:
        print  >>sys.stderr,"reverse match"
320 321 322 323
    reversematch = [(p,p(sequence)) for p in primers if p is not None]
    reversematch.sort(cmp=sortMatch)
    reversematch = reversematch[0] if reversematch[0][1] is not None else None
    
324 325
    if options.debug:
        print  >>sys.stderr,"<<<<",reversematch
326 327 328 329 330 331 332 333 334 335 336
    if reversematch is None and None not in primers:
        if directmatch[0].direct:
            message = 'No reverse primer match'
        else:
            message = 'No direct primer match'
                  
        sequence['error']=message
        return False,sequence
    
    if reversematch is None:
        sequence['status']='partial'
Eric Coissac's avatar
Eric Coissac committed
337
        
338 339
        if directmatch[0].direct:
            tags=(directmatch[1][3],None)
Eric Coissac's avatar
Eric Coissac committed
340
        else:
341
            tags=(None,directmatch[1][3])
Eric Coissac's avatar
Eric Coissac committed
342
            
343
        samples = primers[None]
344
            
345 346
    else:
        sequence['status']='full'
Eric Coissac's avatar
Eric Coissac committed
347
        
348 349 350 351 352 353 354 355
        match=str(sequence[reversematch[1][1]:reversematch[1][2]].complement())
        sequence = sequence[0:reversematch[1][1]]
        
        if directmatch[0].direct:
            tags=(directmatch[1][3],reversematch[1][3])
            sequence['reverse_score']=reversematch[1][0]
            sequence['reverse_primer']=reversematch[0].raw
            sequence['reverse_match']=match
356
            sequence['forward_tag']=tags[0]
357
            sequence['reverse_tag']=tags[1]
Eric Coissac's avatar
Eric Coissac committed
358 359
            
        else:
360 361 362 363 364 365
            tags=(reversematch[1][3],directmatch[1][3])
            sequence['forward_score']=reversematch[1][0]
            sequence['forward_primer']=reversematch[0].raw
            sequence['forward_match']=match
        
        del sequence['cut']
366
        sequence['forward_tag']=tags[0]
367
        sequence['reverse_tag']=tags[1]
Eric Coissac's avatar
Eric Coissac committed
368

369
        samples = primers[reversematch[0]]
Eric Coissac's avatar
Eric Coissac committed
370
        
371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
    
    if not directmatch[0].direct:
        sequence=sequence.complement()
        del sequence['complemented']

    sample=None

    if tags[0] is not None:                                     # Direct  tag known
        if tags[1] is not None:                                 # Reverse tag known
            sample = samples.get(tags,None)             
        else:                                                   # Reverse tag known
            s=[samples[x] for x in samples if x[0]==tags[0]]
            if len(s)==1:
                sample=s[0]
            elif len(s)>1:
                sequence['error']='multiple samples match tags'
                return False,sequence
            else:
                sample=None
    else:                                                       # Direct tag unknown
        if tags[1] is not None:                                 # Reverse tag known
            s=[samples[x] for x in samples if x[1]==tags[1]]
            if len(s)==1:
                sample=s[0]
            elif len(s)>1:
                sequence['error']='multiple samples match tags'
                return False,sequence
            else:                                               # Reverse tag known
                sample=None
                
Eric Coissac's avatar
Eric Coissac committed
401
            
402 403 404 405 406
    if sample is None:
        sequence['error']="Cannot assign sequence to a sample"
        return False,sequence
    
    sequence._info.update(sample)
407
    sequence['seq_length']=len(sequence)
408 409
    
    return True,sequence
Eric Coissac's avatar
Eric Coissac committed
410 411 412 413 414


if __name__ == '__main__':
    
    
415
    optionParser = getOptionManager([addNGSOptions,addInOutputOption], progdoc=__doc__)
Eric Coissac's avatar
Eric Coissac committed
416 417 418 419
                                    
    
    (options, entries) = optionParser()

420 421
#    assert options.direct is not None or options.taglist is not None, \
#         "you must specify option -d ou -t"
Eric Coissac's avatar
Eric Coissac committed
422
         
423 424 425 426 427 428 429 430 431 432 433 434 435 436
    assert options.taglist is not None,"you must specify option  -t"

#    if options.taglist is not None:
    primers=readTagfile(options.taglist)
#TODO: Patch when no taglists
#    else:
#        options.direct=options.direct.lower()
#        options.reverse=options.reverse.lower()
#        primers={options.direct:(options.taglength,{})}
#        if options.reverse is not None:
#            reverse = options.reverse
#        else:
#            reverse = '-'
#        primers[options.direct][1][reverse]={'-':('-','-',True,None)}
Eric Coissac's avatar
Eric Coissac committed
437 438 439 440 441 442
        
    options.primers=primers
        
    if options.unidentified is not None:
        unidentified = open(options.unidentified,"w")
    
Eric Coissac's avatar
Eric Coissac committed
443
    writer = sequenceWriterGenerator(options)
444

Eric Coissac's avatar
Eric Coissac committed
445
    if options.unidentified is not None:
446
        unidentified = sequenceWriterGenerator(options,open(options.unidentified,"w"))
Eric Coissac's avatar
Eric Coissac committed
447 448 449
    else :
        unidentified = None
        
Eric Coissac's avatar
Eric Coissac committed
450 451 452
    for seq in entries:
        good,seq = annotate(seq,options)
        if good:
Eric Coissac's avatar
Eric Coissac committed
453 454 455
            writer(seq)
        elif unidentified is not None:
            unidentified(seq)
Eric Coissac's avatar
Eric Coissac committed
456 457
            
    
458