WFA
WFA copied to clipboard
Feature requests: R bindings and early stopping
Nice work!
Would it be possible to get R bindings for this? I put together a minimal example here: https://github.com/traversc/WavefrontAlignR (feel free to do whatever with it)
I'd also like to a request an "early stopping" feature, where if the best possible alignment distance exceeds a user defined threshold, stop alignment and return a flag value (like INT_MAX
). Assuming this doesn't add too much overhead, this would be useful because I'm mostly interested in finding only highly similar sequences between two sets.
Last, I ran a quick benchmark comparing an existing R package. Is this a fair comparison? Code used to run WFA2 here: https://github.com/traversc/WavefrontAlignR/blob/main/src/WFA_bindings.cpp
# Benchmark for a 10,000 x 10,000 alignment
# "seqs" is a vector of DNA sequences on average 43 bp long
library(WavefrontAlignR)
library(stringdist)
library(tictoc)
# WFA2 levenshtein
tic()
y1 <- WavefrontAlignR::edit_dist_matrix(seqs, seqs)
toc()
# 191.452 sec elapsed, 522324 alignments / sec
# stringdist levenshtein
tic()
y2 <- stringdist::stringdistmatrix(seqs, seqs, method = "lv", nthread=1)
toc()
# 677.356 sec elapsed, 147633 alignments / sec
Sorry for the late reply (I was about to send this message, and then it slipped my mind...).
(1) R bindings
Yes, sure, that would be awesome. At this moment, don't have the bandwidth to implement this feature. But is definitely something I would like to have. Thanks for the example and request.
If you feel like it, you could wrap your example under bindings/r
(linked to the current version) and make a pull request. I would be very happy if you take over and take the credit for it. Only if you want to.
(2) Early stop
There is actually one here. the function wavefront_aligner_set_max_alignment_steps
allows to set the maximum number of sets (i.e., max alignment score) to reach before quitting. Have a look and let me know if that is what are you looking for.
Let me know, Thanks.
(3) (NxN) benchmark
In principle, seems fair to me (edit, score only, ...).