FlowSim - a simulation pipeline for pyrosequencing data

FlowSim

FlowSim is a suite of tools for simulating the 454 pyrosequencing process. It is based on the characteristics of real 454 data, and attempts to model the known aspects of the process. If you use it, please cite Bioinformatics, 26(18):i420-i425..

Installation

Please see the installation instructions, with the following caveat: the old version of flower (i.e. version 0.2.8) requires bio >0.4, but there was an incompatible change in 0.5.

If you have ghc and cabal installed, a simple

cabal install flowsim

should work. Or to build it from source, you can do:

darcs get http://malde.org/~ketil/biohaskell/flowsim
cd flowsim
cabal install

Overview

The various tools are designed to use the standard Fasta format for input and output, and although they support reading from and writing to files, the default is to write to stdin and stdout, and to link them together using Unix pipes. In any case, it is easy to add your own programs, for instance to support mate-pair libraries, or to replace individual steps with your own programs.

Most of the tools provide a –help option, which will list the available configuration options.

Components

The following sections describe the utilities that comprise FlowSim, more or less in the sequence they should be applied:

clonesim

Clonesim simulates the shearing step, which typically breaks an input genome into random fragments. It supports user selected distribution of lengths (e.g. uniform, normal, or log-normal distributions), but only uniformly distributed positions at this point.

Usage:

clonesim -c count -l dist input.fasta

gelfilter

Often, a gel is used for selecting a certain range of clone lengths. This (embarassingly simple) utility takes a set of input sequences, and removes sequences that are either shorter than the minimum or longer than the maximum sizes. It will currently only read from stdin and write to stdout.

Usage:

gelfilter min max

kitsim

As a preliminary step to sequencing, synthetic sequences are attached to the ends of each clone. For 454, the A-adapter is attached to the 5’ end, and the B-adapter is attached to the 3’ end. These adapters contain the primers for the emulsion PCR amplification, that copies up each clone in sufficient quantity for the light signal from luciferase to be detectable during sequencing.

The A-adaptor is found at the beginning of each sequences as the TCAG “key”, while the B-adaptor is sometimes found at the end of sequences when the clone is short enough for it to be fully sequenced.

By default, kitsim uses Titanium adapters, but this is user selectable.

Usage:

kitsim -k key -a adapter input.fasta -o output.fasta

mutator

In theory, these sequences are then sequenced, and any error is introduced in the final stage, due to overlapping distributions of the light intensity generated by the different homopolymer lengths (see below). In practice, we find evidence for mutations in the sequences (Balzer et al, submitted), and we provide a general utility for introducing random substitutions and indels in the sequences.

Usage:

mutator -i indel_rate -s subst_rate input.fasta -o output.fasta

duplicator

It has been reported by a wide variety of authors that second generation sequencing (and not only 454) generates artificial duplicates of many clones. This may be caused by too low amounts of input DNA. The duplicator is an attempt to simulate this. Currently, it only supports stdin/stdout.

Usage:

duplicator dup_prob

flowsim

The final stage is the actual pyrosequencing process, where each input clone is converted to a series of light signals, the intensity of which corresponding to homopolymer lengths. Each homopolymer length is converted to a flow value which is adjusted according to its flow distribution. Then the resulting flows are base called, and quality filters applied (including adapter masking), and the resulting reads are output in an SFF file.

Simplified usage (see –help for more information):

flowsim -G generation input.fasta -o output.sff

While most options for these utilities are optional and have (hopefully sensible) defaults, -o is required for flowsim.

Examples

Given an input genome in input.fasta, a simple simulation could be

clonesim -c 10000 input.fasta | kitsim | flowsim -o output.sff

This will make 10K random fragments from the input sequences, tack on adapters (this is necessary because of flowsim’s filter that removes all reads that don’t start with the key sequence), and simulate pyrosequencing, using the default generation.

Further information

Publications

The first incarnation of FlowSim (which was a monolithic executable) is described in Balzer et al. 2010, and presented at ECCB’10. This describes the pre-0.2.6 version.

A follow-up article was presented at ISMB/ECCB 2011 Balzer et al. 2011

HackageDB page http://hackage.haskell.org/package/flowsim

Darcs source code repository is here http://malde.org/~ketil/biohaskell/flowsim