Flower - extracting information from SFF files
One of the second generation sequencing technologies is 454 pyrosequencing, which produces output in the form of binary SFF files. Although sequence and quality information can be extracted from these files in the traditional Fasta format, the SFF files also contain more detailed information from the sequencing, and easy access to this information can be important, e.g. to accurately assess sequence errors.
Flower is an alternative to 454’s proprietary tools for dealing with SFF files, and it is now distributed as part of the biosff library. To install it, use
cabal install biosff
or see the installation page for details.
Flower now comes with a small test script (“test.sh”) which downloads an SFF file and verifies a few of the output formats against a manually checked reference.
If you use this, please cite Bioinformatics, 27(7):1041–1042.
Flower reads one or more SFF files, and outputs information in one or more formats.
Flower can summarize information from its input in two ways, either output header information (-i) or output a summary of each read (-s). The header information just tells you the location of the index in the SFF file (probably not terribly useful?), the number of reads, the flow length, and the key that each read is supposed to start with. Read summaries look like this:
# name........ date...... time.... reg trim_l trim_r x_loc y_loc len K2 trimK2 ncount avgQ travgQ FQNS9YO01C29CP 2009-02-09 13:04:36 01 5 60 1150 647 300 0.67 0.77 0 25.15 35.64 FQNS9YO01AINLO 2009-02-09 13:04:36 01 5 340 95 1386 384 0.65 0.66 0 26.63 28.08 FQNS9YO01DTOTU 2009-02-09 13:04:36 01 5 202 1451 864 302 0.59 0.81 1 26.13 39.01
The fields are, from left to right: the read name, the date, time, and plate region (as encoded in the read name, question marks if the read name won’t parse), left and right clipping position, x and y location on the plate, read length, K² (a measure of how much the flow values deviate from integrals), K² for the trimmed read, the number of N base calls (a bad omen for 454 reads), average base quality, and finally, average base quality for the trimmed read.
Fasta (and .qual)
The most common format for any sequence data is probably Fasta. The SFF file embeds the sequence and quality information, and this can be extracted using
flower input.sff -f=output.fasta -q=output.qual
For any output format you can also output to stdout by omitting the file name, so to dump the reads as Fasta to stdout, you can just do:
flower input.sff -f
but obviously, this only works for one output type, if you want more than one, one of them needs to go to a file:
flower input.sff -f -q=output.qual
A format that is gaining ground is the FastQ format. Unfortunately, there is (at least) two different interpretations of the quality values, so we have two options supporting this. To output FastQ with Sanger/Phred style qualities:
flower input.sff -Q=output.fastq
Alternatively, you can output with Illumina-style qualities:
flower input.sff -I=output.fastq
So much for raw sequence information. The usefulness of SFFs is that they contain the raw flowgram information, and each base (or really homopolymer run) is derived from a flow value. To access these flow values, the simplest format is likely the text format, it looks like this:
>FQNS9YO01C29CP Info: 2009-02-09 13:04:36 R1 (1150,647) Clip: 5 60 Flows: 1.15 0.01 1.04 0.02 0.02 0.82 0.32 1.24 0.92 0.29 1.07 0.33 1.09 0.02 1.06 0.02 0.98 0.01 1.05 0.01 0.94 0.01 1.04 0.01 0.93 0.01 1.06 0.01 .... Index: 1 3 6 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 .... Bases: tcagTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCtctcgtcgtcgtcgtcgtcgtcgtcgtcgtacgtacgtacgtacgtacgtacgtacgtacgtcgtcgtcgtcgtcgtcgt.... Quals: 37 35 25 26 26 28 28 37 37 37 37 37 37 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 ....
This presents the information from each read in an SFF “read block” in an easily parsed textual format. In brief, the Info field is stuff that’s decoded from the read name: date, time, region (the plate is usually divided in two regions), and coordinates on the plate. If the name doesn’t conform to the standard (e.g. it was generated with FlowSim, this field will be omitted. Next is the clipping information, identifying the high quality segment of the read (using base coordinates, not flows). Then the Flows, the values that are converted to homopolymer lengths (approximately) by rounding to the nearest integer. These are generated by TACG cycles, so 1.15 corresponds to the first T, then no A (0.01), then one C (1.04) and so on. Index links each called base to a position in Flows, so we see that Flow 1, 3, 6, and 8 correspond to the first four bases, the TCAG key.
Finally, Bases is the called sequence, and Quals the associated base qualities.
This is the default output format from Flower, so you just need to do:
If you wish to write it to a file, you can achieve this with
flower input.sff -T=output.txt
Looking at the text output example, you’ll notice a field called “Clip”. This shows the clipping information, typically the first four bases are the “key” (the end of the A-adapter), and then there is low quality and sometimes a B-adapter at the far end. Flower provides two options to trim the output, either you can trim just the key:
flower --trimkey input.sff [output options]
Or you can trim both ends:
flower -t input.sff [output options]
In any case, for all output formats, the part of the read that is between trimpoints is output as upper case, and the trimmed-away bits are output in lower case.
Flower also supports outputting a histogram of flow values for each nucleotide (-h), a histogram of flow values by flow value and flow position, and a verbose (and slow) tabular flow value dump (-F). These are mostly useful for debugging and statistical analysis of SFF files.
If you get a recent version (as of September 2012, you need to go to the darcs repository), you can plot the distribution of flow values by using -h in combination with -p, as in:
flower -h -p /home/ketil/work/biohaskell/flower/F5A5JTJ01.MID5.sff | gnuplot -persist
This might be useful to check for artifacts in the sequencing.
Performance of alternatives for extracting sequences from SFF files. All times are in seconds, using a 3.4GHz CPU, using a 2.2Gb SFF file (approx. 700K Titanium reads) as the input. The sffinfo program is Roche’s proprietary utility, sff_extract is a Python script by Jose Blanca, and sff2fastq is a C utlitity by Indraniel Das.
The old home page is here: http://blog.malde.org/index.php/flower