The ADPfusion library is missing a bunch of low-level optimization tweaks. Some are more easy to implement.

I have a soon-to-be-uploaded library Lib-OrderedBits for bit-fiddling which is yet written with best performance in mind. Correct implementation of these would improve performance of a bunch of programs.

One big missing part is anything SSE-related in ADPfusion. Any work here probably ends up quite fast in vector-territory and should be synchronized with the GHC/Haskell org.