GSOC/kmerindex

K-mer index

  • Mentor: Ketil
  • Umbrella: haskell.org

I am currently working on a program that counts k-mers (and does various analyses on the result). Currently, I am using a Judy array to store the k-mers, and while this is an order of magnitude more compact than other Haskell data structures - and thus also fast - it has a couple of drawbacks. It lives in IO, which I think is an okay compromise, but it also:

  • is limited to k<=32
  • is single threaded

Approach

To remedy this, a more advanced data structure is needed. The details can be fleshed out later, but basically, I think a good approach might be to split the k-mer into three (an l-, m-, and n-mer, say, where l+n+m=k). We could then have the l-mer select a Judy array (with each array having its own thread), the m-mer is an index in the Judy array, and the value is a pointer to a hash table storing the hased n-mer and the count. Or something like that.

Strenghts

Although I need it to store k-mers, the resulting data structure is by no means limited to this purpose. It would basically be a multi-threaded, compact, and fast finite map implementation, storing arbitrary - or at least fairly long integer keys and fixed-size values. Since this would be a useful addition to Haskell in general, and not only this application, I think it might work as a haskell.org project.