Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
O
ORG.Asm
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 19,166
    • Issues 19,166
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
  • ORG.Asm
  • ORG.Asm
  • Issues
  • #6

Closed
Open
Opened Aug 24, 2015 by Peter Cock@p-j-a-cock

os buildgraph - clarify if seeds must be proteins

The help text suggests the seeds must be proteins, but appears to have built in nucleotide seed sets and the next argument suggests the seed reads can be DNA:

$ oa buildgraph -h
...
  --seeds seeds         protein seeds; either a fasta file containing seeds
                        proteic sequences or internal set of seeds among
                        ['nucrRNAAHypogastrura', 'nucrRNAArabidopsis',
                        'protChloroArabidopsis', 'protMitoCapra',
                        'protMitoMachaon']
  --kup ORGASM:KUP      The word size used to identify the seed reads
                        [default: protein=4, DNA=12]

Reading python/orgasm/indexer/_orgasm.pyx it appears to attempt to auto-detect protein vs DNA seeds.

    cpdef dict lookForSeeds(self, dict sequences, int kup=-1, int mincov=1,object logger=None):

        cdef AhoCorasick patterns
        cdef dict matches
        cdef str k   
        cdef bint  nuc   
        
        nuc = all([isDNA(sequences[k]) for k in sequences])
            
        if nuc:
            if logger is not None:
                logger.info('Matching against nucleic probes')
            patterns = NucAhoCorasick()
            kup = 12 if kup < 0 else kup
        else:
            if logger is not None:
                logger.info('Matching against protein probes')
            patterns = ProtAhoCorasick()
            kup = 4 if kup < 0 else kup

        for k in sequences:
            patterns.addSequence(sequences[k],k,kup)

        
        patterns.finalize()
                
        #minmatch = 50 if nuc else 15
        minmatch = int(self.getReadSize() // (2 if nuc else 6))

        matches = patterns.scanIndex(self,minmatch,-1,mincov)

        return matches

Assuming the seeds can be DNA, clarify the --seeds help (and explain for example if an entire reference mitochondria could be used, or if the seeds should be fragments only, for example genes from a known mitochondria from a related species).

Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: org-asm/org-asm#6