kmx

Counting k-mers

kmx is a program that reads input sequences in Fasta or FastQ format, and builds a k-mer index. Currently, k-mers are counted together with their reverse complement, this is usually what you want. In contrast to most other tools of its kind, kmx uses Judy arrays to store k-mers instead of hash tables. In practice, this doesn’t matter a whole lot, but I think in many cases, kmx may use less memory.

Basic functionality

Currently, kmc has much of the same functionality as similar tools like Khmer or jellyfish. It can count k-mers up to a size of 32 (due to the Judy array’s limitation of 64 bits), and it can now also do partial indexes. The latter means that the counting process can be broken up into smaller sub-processes (and there’s a script included for doing that) and the outputs merged. Indexing 200GB of FastQ data broken into 32 processes takes 1:40 to 3:00 hours for the processes (using 32 2.7GHz Xeon cores), and then 47 minutes for merging. This is in the ballpark for other tools mentioned in the Turtle paper, but with kmx, each process takes from 700MB to 6GB of RAM, meaning it could easily be distributed over a cluster on normal workstations, or you could run them sequentially on your laptop. (Running it in a single-threaded process takes 21 hours.)

File format

Kmx uses a compressed file format, with uses a differential coding of the kmers, making it compact as well as simple to parse and generate, and which allows for streaming analysis (e.g. histograms and regression). One would wish others would adopt it, but I don’t get my hopes up. The 200GB test data results in a 20GB index.

Downscaling k-mer size

A k-mer index built for a certain size k can also be downscaled during analysis, so you can for instance generate a set of histograms for k-mer sizes 22-32 without recounting each k-mer size. (To be honest, this is approximate, as this only count prefixes of the counted k-mers, you lose a few at the end of each input sequence.)

Usage

The –help option is guaranteed to give you more detailed and accurate and up-to-date information than this, but here’s a quick overview of the different modes:

  • kmx count - builds the k-mer index

  • kmx dump - dump k-mers and associated counts

  • kmx hist - output a histogram of frequencies and the number of k-mers with this frequency

  • kmx corr - produce correlations and regression parameters between k-mer indices

  • kmx heatmap - produces a heatmap of k-mer frequencies across two indices

Get it

…from the darcs repository at http://malde.org/~ketil/biohaskell/kmx.

…or use its new home at GitHub, http://github.com/ketil-malde/kmx.