strobealign icon indicating copy to clipboard operation
strobealign copied to clipboard

De-duplicate `seq_to_randstrobes2` and `seq_to_randstrobes2_read`

Open marcelm opened this issue 1 year ago • 1 comments

marcelm avatar Sep 19 '22 18:09 marcelm

Some comments on why I dit it in two separate functions. (also see comments under the function header in the code, lines 535-538)

The reference only need seeds created in the forward ditection. The queries needs seeds extracted from both directions, since they are not neccesarily the same(!), as opposed to canonical k-mers which are the same in both directions.

However, by using canonical syncmers we can at least avoid to compute them in both directions, but we still have to do the 'linking' in both directions.

What is needed 'roughly' is:

  1. create canonical syncmers
  2. create randstrobes fw strand (the linking of two syncmers)
  3. create randstrobes rc strand (the linking of two syncmers)

Therefore, I think one way is to break it up into seq_to_canonical_syncmers(..) and sycmers_to_seeds(..), where sycmers_to_seeds(..) would be called twice for the reads; with the original list of syncmers (fw direction), and with the list of syncmers reversed (RC direction). Note that also the vector pos_to_seq_choord needs not only to be reversed but re-coordinated as in the code:

    for (unsigned int i = 0; i < nr_hashes; i++) {
        pos_to_seq_choord[i] = read_length - pos_to_seq_choord[i] - k;
    }

ksahlin avatar Sep 20 '22 07:09 ksahlin