An important goal of the BioHaskell effort is to organize libraries. Below is the list of the various libraries and their current state.
Shared definitions library: biocore
This contains (very) basic functionality and data types intended to be shared among other libraries. Using it ensures that libraries are compatible, and that the same types are used to represent the same things.
Sequence format parsers
Reading and writing (although biocore can generate Fasta-formatted ByteStrings directly) Fasta-formatted sequences.
Reading and writing (but see above) FastQ-formatted sequences.
Functionality for dealing with SFF files, as produced by Roche 454 and ABI Ion Torrent sequences. Includes the flower executable, which can convert SFF files into a variety of formats.
Alignment format parsers
Parsing the BLAST XML output format.
Parse Multiple EM for Motif Elicitation (MEME) XML output.
The ACE alignment output format.
The PHD sequence format, as output by Phred.
A small library enabling reading and writing of PSL files, as output by e.g. BLAT. It also contains some example programs for extracting and manipulating PSL data.
Package supports parsing and rendering of files in Stockholm 1.0 format. These formats are used by Pfam and Rfam for multiple sequence alignments. The library supports both an streaming interface that runs in constant memory and a convenient document interface that uses as much memory as the largest family in the Stockholm file. Both interfaces are accessed using the conduit but a lazy version for the document interface is provided for one-off scripts.
The library by Nick Ingolia provides facilities for working with sequence locations, for instance to describe and manipulate genome annotations.
Searches for a provided nucleotide or protein sequence with the NCBI Blast REST service and returns a blast result in xml format as BlastResult datatype.
RNA secondary structure
Parsing parameter files
Work with RNA secondary structure parameter files, transform strings into highly efficient internal format and some functions for dealing with Infernal covariance models. The library will be extended with several “DataSource”s soon. This will allow users to import typical data easily.
Note: the other libraries Biobase are deprecated, their functionality is now included in Biobase.*
- import/export secondary structures based on some form of the Vienna dot-bracket notation (((…(((…)))..)))
- import/export extended secondary structures as used by RNAwolf
- FR3D contains already parsed PDB RNA structures
- this library extracts basepairs and sequence from FR3D data
- including complete directories full of entries
- verbose hits
- tabulated hits
- stockholm files
- covariance models
- currently being converted to iteratee
- reading of MAF files
- based on iteratee
- TrainingData to be used for training RNAwolf
- imports from FR3D and DotP
- exports trainingdata elements
- imports trainingdata
- import Turner 2004 energy parameter files
- rna primary and secondary structure
- tree-based representations
- some datasources reading and writing dot-bracket and similar notations (e.g. rnastrand data)
- named -xna instead to support both -dna and -rna. The internals are a bit rough, but since this is targeting high-performance stuff, it is ok
- Importer and Exporter for Vienna energy files. Allows converting Turner parameter files to Vienna parameter files.
- Provides an algebraic ring class and instances for Gibbs free energy, partition function probabilities, and scores. Conversion between different entities is provided by a convert function. All entities are ready for the vector library.
- Enumeratees for FASTA-handling and convenience functions. In a typical application, the user should write an enumeratee to extract information to allow for efficient low-memory handling of queries.
- vienna rnafold v2.0
- im- and exporting of turner and vienna tables
- asymptotically fast reimplementation of mc-fold (parisien, major, 2008)
- importing of mcfold-db
- extended rna secondary structure folding
- version 0.3 includes full stacking
- folding is reasonably fast due to the use of additional arrays (expect to fold 300-500 nt in seconds)
- 2-diagrams will be back soon, if no bugs show up
- complete 2-diagrams for multibranched loops will follow later due to the large constant overhead
Tertiary structures (3D)
For reading and analyzing of Protein Databank format files.
Nuclear Magnetic resonance
Parsing STAR* format files from Biological Nuclear Magnetic Resonance Databank.
Parsing output of TALOS+ program for predicting protein backbone torsion angles from chemical shifts.
This library contains data types for sequences and various kinds of alignments. Functionality for reading and writing many different file formats. Development is driven by the needs of applications, so while large parts of the library is solid and efficient, other parts are less mature or feature complete.
Planned new libraries
The following libraries are “kind of” new. Mostly, they have been part of some package but are sufficiently different that they can stand apart from bioinformatics in general.
An optimization scheme that has been used successfully for NLP, RNA secondary structure and other tasks (I guess ;-).
Depending on time constraints, a variant of the Haskell GLPK library, but for convex optimizers.