GSoC 2014 - Transalign

alternate text

(source:http://learnyouahaskell.com,
cc Miran Lipovača)

The Google Summer of Code is a great opportunity for students to work on an own project and at the same time get a deeper insight in the Open Source community. The students gain new experiences not only while programming but also while working together with people from all over the world, since the Open Source community has a lot of members from different places and countries. I like this idea and the way of thinking and so, I decided to participate and to write a proposal.
I decided to write the proposal for the transalign project mentored by Ketil Malde with the Open Bioinformatics Foundation as the mentoring organization.

This page serves as a blog during the time of the GSoC to document my work on the project. In the following sections I describe the project with its related work and the aim I want to reach. Then I describe the project plan. Therefore I copied the timeline I wrote in my proposal and I will write a weekly post which includes information about what I did in the last week and what I plan to do in the following week.

The Project

Transitive alignments are a new and challenging idea, as they provide a completely new approach to an relatively old problem regarding bioinformatics. Alignments are mostly used to find homolog sequences. Depending on the reference sequences found in sequence databases, the scores may be quite low even for correct results. This is due to the fact that large sequence databases are barely curated and so, they are error-prone due to inaccurate data. Transitive alignments try to avoid this problem by scoring more than one alignment, namely at least two alignments using different databases. In this way, the trade-off between large uncurated and small curated sequence databases can be used efficiently. Thus, using transitive alignments, the uncurated databases can be used to explore a large number of sequences, whereas the small curated databases serve as a good reference and decrease the amount of mistakes. In this way, the sensitivity for finding homolog sequences is increased and the use of different databases results in longer, less error-prone alignments than the ones found by usual alignment programs.
The program is written in Haskell, a functional programming language, which is a good choice for problems having a mathematical basis. Additionally, Haskell provides nice and helpful approaches for high-performance (e.g. vector and ADPfusion libraries), parallelism (e.g. repa and parallel packages) and concepts as software transactional memory (STM).

The wiki page for the transalign project can be seen here and the corresponding paper can be found here.

Description

alternate text

(source:
http://learnyouahaskell.com,
cc Miran Lipovača)

Transalign is a new approach to find homologous sequences for a given query sequence using pairwise alignments. To combine the advantages of using a large database with the use of a small but curated one, the query sequence is aligned in two steps: First, the query sequence Q is aligned against a large intermediate database B using blastx. As a second step blastp is used to align the large intermediate database B with a small but curated database, the target database T. Transalign then uses the resulting alignments as an input. Then, it searches for the sequences in the intermediate database which can be found in both alignments. Based on those parts an alignment between the query sequence Q and the target database T can be found. In this way, a higher sensitivity can be reached when aligning against the target database.

Aim

Analyzing the transalign program on performance and space consumption and improve both by converting the program in high-performance Haskell code. Depending on the time, I will discuss the possibilities on how to make transalign available as a web service.

The Student :)

My name is Sarah Berkemer and I am a Master student in Bioinformatics. Therefore, I am interested to contribute to the OBF by providing Open Source bioinformatics programs. Additionally I like the idea of functional programming, especially Haskell since I plan to use Haskell during my Master thesis, too.

Project Plan

In my proposal I wrote a detailed timeline about what I plan to do during the weeks of GSoC. This timeline can be seen in the next subsection. During the GSoC I continue writing the Work in Progress subsection. Each week I will write some sentences about what I did the last week and what I plan to do the next week.

Proposal Timeline

The timeline I wrote in my proposal can be seen here.

Work in Progress

alternate text

(source:http://learnyouahaskell.com,
cc Miran Lipovača)

In this subsection I plan to do a (bi-)weekly post to let you know what I did the last week(s) and what I plan to do in the following week(s). This might differ from the timeline I originally wrote. ;)

References

Interesting and useful!

Paper

The transalign paper: Malde K, Furmanek T (2013) Increasing Sequence Search Sensitivity with Transitive Alignments. PLoS ONE 8(2).

Code

The original code can be found here.
This is my github account containing the code I am currently working on.

Haskell

Learn You a Haskell for Great Good!
Real World Haskell
Hoogle