strobealign icon indicating copy to clipboard operation
strobealign copied to clipboard

A potential way to save space for genomes smaller than 2^32nt.

Open ksahlin opened this issue 2 years ago • 6 comments

Consider the following genome of three 20nt long chromosomes:

>chr1
AGACGCAGCGAGACGCAGCG
>chr2
TGCGTTAGTACGATTGCATC
>chr3
CCTTTAGTACGATTGCATCT

Instead of representing the seed location as e.g. (chr3, 10) for position 10 on chr3, we could concatenate the strings and represent it as position 50. The only time this needs to be 'converted back' to (chr3, 10) is for the sam alignments. Such an offset vector could be implemented [0,20,40], storing the offset positions of each chromosome in the concatenated string. One can do binary search over the very small vector to find the right chr_id. The index in the offset vector is the index of our chromosome id (and we already have an index-> chromosome name vector.

How about our hash table (call it H)? Currently, H contains seed hash value as keys and two different values:

If unique seed: stores reference chr_id and packed seed offset in 4 bytes and chromosome position in 4bytes. If repetitive: store position in a vector with repetitive seeds (4 bytes), and the count of the seed (4bytes).

I think we can use only 4-bytes in H for both cases.

If unique seed: If one is willing to discard the packed strobe 2 position (taking 8 bits) we could go from 8 bytes down to 4 bytes. Here, the 4 bytes would simply be the position in the concatenated string. The strobe 2 position is, from what I remember, only used for checking if distance between the two strobes are the same on query and reference. If yes, first try hamming distance. otherwise go directly to alignment. We could instead always try hamming distance first as nearly no cost.

If repetitive seed: We could store only the offset position in the vector in 4 bytes in H. The count could instead be fetched by having a (new) hash map M : {hash_value -> hash_value_next} which is obtained after sorting the vector on hash values. M should be fairly succinct as is only need to store one key-value pair for each repetitive seed. Getting the count (i.e., number of positions to iterate over in the vector) for a seed with the i:th smallest hash value h_i would require four operations:

  1. Obtain offset o_i of h_i from querying H
  2. Obtain the next hash value h_{i+1} throughM
  3. Obtain offset o_{i+1}of h_i+1 in H
  4. count = o_{i+1} - o_i

The operations 1-3 of course comes at a cost in runtime compared to only have one lookup. I wonder how much if e.g. computation of prefetched data sa suggested in #203.

Edit: The packed strobe 2 position is also used for merging hits into 'merged hits' (called NAMs in the code). Two seed hits are merged if they overlap on both the query and the reference in the same order. To detect overlap, we use the length of the seed, where the position of the second strobe is needed. The extent to which the exact length of seeds is needed (or whether an approximate seed-length can be used) remains to be explored.

ksahlin avatar Feb 05 '23 19:02 ksahlin

Just documenting an idea down, not a priority. But would be interesting to hear you opinion about this @marcelm.

ksahlin avatar Feb 05 '23 19:02 ksahlin

(Note: I try to use the term "contig" to describe a single record in the input reference FASTA because I find "reference" to be ambigous – it can also be used to describe the entire input FASTA, i.e. the set of contigs).

Concatenating the contigs and using a single integer as position is a good idea. I think BWA does this, and it is how it was done in an experimental read mapper I worked on a long time ago. It makes a lot of sense, but may require quite a bit of changes. (Or maybe not?)

I think the actual contig index is needed earlier than at the time when writing out the SAM record, but that’s not a problem because the binary search is probably very fast. For example, we would need to avoid that an alignment extends from one contig into another. That could be done at the seed or extend stage, but either way, one would need to know where the boundary to the next contig is.

Maybe we could add a couple of N characters between the contigs to at least prevent randstrobes from spanning across contigs.

BTW, I think a better way to think about the "reference chr_id and packed seed offset in 4 bytes and chromosome position in 4bytes." data structure (currently called RefRandstrobe) is to say that it consists of three fielsds, a 24-bit chromosome id, an 8-bit seed offset, and a 32-bit chromosome position. The fact that this is actually somehow packed into two 32-bit fields is an implementation detail. There’s even C syntax for bitfields which would make this even clearer in the code. (But I haven’t gotten around to send in a PR for that.)

The extra hash table for the "next hash offset" sounds good.

marcelm avatar Feb 07 '23 12:02 marcelm

Great. Thanks for feedback. Will note it down as a potential thing to try.

ksahlin avatar Feb 07 '23 20:02 ksahlin

It's funny, I came to ask how strobealign will work with collections of many genomes (we have an immediate application). That would completely not work with this optimization!

ekg avatar Feb 08 '23 15:02 ekg

Good point. This optimization would definitely need to be made optional. We regulaly test on rye, for example, which is already larger than $2^{32}$ bp (it has 6 Gbp).

marcelm avatar Feb 08 '23 15:02 marcelm

Agreed, perhaps the same idea/approach could be used with 64bit ints, or template depending on reference database size?

ksahlin avatar Feb 08 '23 21:02 ksahlin