os buildgraph - clarify if seeds must be proteins
The help text suggests the seeds must be proteins, but appears to have built in nucleotide seed sets and the next argument suggests the seed reads can be DNA:
$ oa buildgraph -h
...
--seeds seeds protein seeds; either a fasta file containing seeds
proteic sequences or internal set of seeds among
['nucrRNAAHypogastrura', 'nucrRNAArabidopsis',
'protChloroArabidopsis', 'protMitoCapra',
'protMitoMachaon']
--kup ORGASM:KUP The word size used to identify the seed reads
[default: protein=4, DNA=12]
Reading python/orgasm/indexer/_orgasm.pyx
it appears to attempt to auto-detect protein vs DNA seeds.
cpdef dict lookForSeeds(self, dict sequences, int kup=-1, int mincov=1,object logger=None):
cdef AhoCorasick patterns
cdef dict matches
cdef str k
cdef bint nuc
nuc = all([isDNA(sequences[k]) for k in sequences])
if nuc:
if logger is not None:
logger.info('Matching against nucleic probes')
patterns = NucAhoCorasick()
kup = 12 if kup < 0 else kup
else:
if logger is not None:
logger.info('Matching against protein probes')
patterns = ProtAhoCorasick()
kup = 4 if kup < 0 else kup
for k in sequences:
patterns.addSequence(sequences[k],k,kup)
patterns.finalize()
#minmatch = 50 if nuc else 15
minmatch = int(self.getReadSize() // (2 if nuc else 6))
matches = patterns.scanIndex(self,minmatch,-1,mincov)
return matches
Assuming the seeds can be DNA, clarify the --seeds
help (and explain for example if an entire reference mitochondria could be used, or if the seeds should be fragments only, for example genes from a known mitochondria from a related species).