Flower - extracting information from SFF files

Flower - extracting information from SFF files

One of the second generation sequencing technologies is 454 pyrosequencing, which produces output in the form of binary SFF files. Although sequence and quality information can be extracted from these files in the traditional Fasta format, the SFF files also contain more detailed information from the sequencing, and easy access to this information can be important, e.g. to accurately assess sequence errors.

Flower is an alternative to 454’s proprietary tools for dealing with SFF files, and it is now distributed as part of the biosff library. To install it, use

cabal install biosff

or see the installation page for details.

Flower now comes with a small test script (“test.sh”) which downloads an SFF file and verifies a few of the output formats against a manually checked reference.

If you use this, please cite Bioinformatics, 27(7):1041-1042.

Output formats

Flower reads one or more SFF files, and outputs information in one or more formats.

Summarizing

Flower can summarize information from its input in two ways, either output header information (-i) or output a summary of each read (-s). The header information just tells you the location of the index in the SFF file (probably not terribly useful?), the number of reads, the flow length, and the key that each read is supposed to start with. Read summaries look like this:

# name........  date......      time....        reg     trim_l  trim_r  x_loc   y_loc   len     K2      trimK2  ncount  avgQ    travgQ
FQNS9YO01C29CP  2009-02-09      13:04:36        01      5       60      1150    647     300     0.67    0.77    0       25.15   35.64
FQNS9YO01AINLO  2009-02-09      13:04:36        01      5       340     95      1386    384     0.65    0.66    0       26.63   28.08
FQNS9YO01DTOTU  2009-02-09      13:04:36        01      5       202     1451    864     302     0.59    0.81    1       26.13   39.01

The fields are, from left to right: the read name, the date, time, and plate region (as encoded in the read name, question marks if the read name won’t parse), left and right clipping position, x and y location on the plate, read length, K² (a measure of how much the flow values deviate from integrals), K² for the trimmed read, the number of N base calls (a bad omen for 454 reads), average base quality, and finally, average base quality for the trimmed read.

Fasta (and .qual)

The most common format for any sequence data is probably Fasta. The SFF file embeds the sequence and quality information, and this can be extracted using

flower input.sff -f=output.fasta -q=output.qual

For any output format you can also output to stdout by omitting the file name, so to dump the reads as Fasta to stdout, you can just do:

flower input.sff -f

but obviously, this only works for one output type, if you want more than one, one of them needs to go to a file:

flower input.sff -f -q=output.qual

FastQ

A format that is gaining ground is the FastQ format. Unfortunately, there is (at least) two different interpretations of the quality values, so we have two options supporting this. To output FastQ with Sanger/Phred style qualities:

flower input.sff -Q=output.fastq

Alternatively, you can output with Illumina-style qualities:

flower input.sff -I=output.fastq

Textual format

So much for raw sequence information. The usefulness of SFFs is that they contain the raw flowgram information, and each base (or really homopolymer run) is derived from a flow value. To access these flow values, the simplest format is likely the text format, it looks like this:

>FQNS9YO01C29CP
  Info:         2009-02-09 13:04:36 R1 (1150,647)
  Clip:         5 60
  Flows:        1.15 0.01 1.04 0.02 0.02 0.82 0.32 1.24 0.92 0.29 1.07 0.33 1.09 0.02 1.06 0.02 0.98 0.01 1.05 0.01 0.94 0.01 1.04 0.01 0.93 0.01 1.06 0.01 ....
  Index:        1 3 6 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 ....
  Bases:        tcagTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCtctcgtcgtcgtcgtcgtcgtcgtcgtcgtacgtacgtacgtacgtacgtacgtacgtacgtcgtcgtcgtcgtcgtcgt....
  Quals:        37 35 25 26 26 28 28 37 37 37 37 37 37 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 ....

This presents the information from each read in an SFF “read block” in an easily parsed textual format. In brief, the Info field is stuff that’s decoded from the read name: date, time, region (the plate is usually divided in two regions), and coordinates on the plate. If the name doesn’t conform to the standard (e.g. it was generated with FlowSim, this field will be omitted. Next is the clipping information, identifying the high quality segment of the read (using base coordinates, not flows). Then the Flows, the values that are converted to homopolymer lengths (approximately) by rounding to the nearest integer. These are generated by TACG cycles, so 1.15 corresponds to the first T, then no A (0.01), then one C (1.04) and so on. Index links each called base to a position in Flows, so we see that Flow 1, 3, 6, and 8 correspond to the first four bases, the TCAG key.

Finally, Bases is the called sequence, and Quals the associated base qualities.

This is the default output format from Flower, so you just need to do:

flower input.sff

If you wish to write it to a file, you can achieve this with

flower input.sff -T=output.txt

Trimming

Looking at the text output example, you’ll notice a field called “Clip”. This shows the clipping information, typically the first four bases are the “key” (the end of the A-adapter), and then there is low quality and sometimes a B-adapter at the far end. Flower provides three options to trim the output, either you can trim just the key:

flower --trimkey input.sff [output options]

Or you can trim both ends:

flower -t input.sff [output options]

Or, as of version 0.3.7, you can trim adapters:

flower -a input.sff [ouptut options]

In any case, for all output formats, the part of the read that is between quality trimpoints is output as upper case, and the trimmed-away bits are output in lower case. (I.e. trimming with -t should result in all-uppercase output). Note that you can combine trimming with -E to remove reads where no sequence remains (also as of 0.3.7). If you want to modify the SFF file instead, Fclip provides similar trimming capabilities for SFF output.

Other formats

Flower also supports outputting a histogram of flow values for each nucleotide (-h), a histogram of flow values by flow value and flow position, and a verbose (and slow) tabular flow value dump (-F). These are mostly useful for debugging and statistical analysis of SFF files.

If you get a recent version (as of September 2012, you need to go to the darcs repository), you can plot the distribution of flow values by using -h in combination with -p, as in:

flower -h -p /home/ketil/work/biohaskell/flower/F5A5JTJ01.MID5.sff | gnuplot -persist

This might be useful to check for artifacts in the sequencing.

Performance

program fasta fasta+qual fastq
flower 21.4 146.5 23.1
sffinfo 23.0 77.0 N/A
sff_extract 234.2 233.1 163.2
sff2fastq N/A N/A 10.4

Performance of alternatives for extracting sequences from SFF files. All times are in seconds, using a 3.4GHz CPU, using a 2.2Gb SFF file (approx. 700K Titanium reads) as the input. The sffinfo program is Roche’s proprietary utility, sff_extract is a Python script by Jose Blanca, and sff2fastq is a C utlitity by Indraniel Das.

Further information

The old home page is here: http://blog.malde.org/index.php/flower

The biosff library on HackageDB: http://hackage.haskell.org/package/biosff and darcs source repository: http://malde.org/~ketil/biohaskell/biosff