Realease 4.1.0

November 16th, 2023. Release 4.1.0

New feature

  • In the OBITools language a new gc computes the GC fraction of a sequence.
  • First version of the obisummary command. It produces summary statistics of the sequence file provided as input. The statistics includes, the number of reads, of variants, the total length of the DNA sequences (equivalent to obicount), some summaries about tags used in the sequence annotations and their frequencies of usage.
  • First version of the obimatrix command. It allows producing OTU tables from sequence files in CSV format.
  • The obicsv command has now a --auto option, that extract automatically the attributes present in a file for inspecting the beginning of the sequence file. Only attributes that do not correspond to map are reported. To extract information from map attributes, see the obimatrix command.

Enhancement

  • A new completely rewritten GO version of the fastq and fasta parser is now used instead of the original C version.
  • A new file format guesser is now implemented. This is a first step towards allowing new formats to be managed by OBITools.
  • New way of handling header definitions of fasta and fastq formats with JSON headers. The sequence definition is now printed in new files as an attribute of the JSON header named "definition". That's facilitates the writing of parsers for the sequence headers.
  • The -D (--delta) option has been added to obipcr. It allows extracting flanking sequences of the barcode.
    • If -D is not set, the output sequence is the barcode itself without the priming sites.
    • If -D is set to 0, the output sequence is the barcode with the priming sites.
    • When -D is set to ### (where ### is an integer), the output sequence is the barcode with the priming sites,
      and ### base pairs of flanking sequences.
  • A new output format in JSON is proposed using the --json-output. The sequence file is printed as a JSON vector, where each element is a map corresponding to a sequence. The map has at most four elements:
    • "id" : which is the only mandatory element (string)
    • "sequence" : if sequence data is present in the record (string)
    • "qualities" : if quality data is associated to the record (string)
    • "annotations" : annotations is associated to the record (a map of annotations).

Bugs

  • in the obitools language, the composition function now returns a map indexed by lowercase string "a", "c", "g", "t" and "o" for other instead of being indexed by the ASCII codes of the corresponding letters.
  • Correction of the reverse-complement operation. Every reverse complement of the DNA sequence follow now the following rules :
    • Nucleotide codes are complemented to their lower complementary base
    • . and - characters are returned without change
    • [ is complemented to ] and oppositely
    • all other characters are complemented as n
  • Correction of a bug is the Subsequence method of the BioSequence class, duplicating the quality values. This made obimultiplex to produce fastq files with sequences having quality values duplicated.

Becareful

GO 1.21.0 is out, and it includes new functionalities which are used in the OBITools4 code. If you use the recommanded method for compiling OBITools on your computer, their is no problem, as the script always load the latest GO version. If you rely on you personnal GO install, please think to update.