dozeu icon indicating copy to clipboard operation
dozeu copied to clipboard

synchronization with vgteam's fork

Open ekg opened this issue 5 years ago • 5 comments

There is a lot going on here, but my understanding is that a number of issues have been resolved.

I think it's important to synchronize things. Upstream dozeu continues to get improvements and other fixes.

@jeizenga @mr-c @nemequ and @adamnovak might be able to describe relevant things.

ekg avatar Aug 30 '20 14:08 ekg

For my part, I've done:

  • Max gap length determined on a per-alignment basis rather than per-aligner
  • Bug fix to dynamic programming
  • Bug fix to the full length bonus implementation
  • Include-time option to use VG's quality-adjusted scoring (this is the largest amount of code)

As I recall, the reason we were keeping the vgteam fork slightly upstream of the master branch here is that the option to have different insertion and deletion scores did not appear to be complete. Moreover, this is not a functionality that we had particular need for in VG.

jeizenga avatar Aug 30 '20 22:08 jeizenga

The commits from myself and @nemequ enable building dozeu for non-x86 and on x86 CPUs that lack SSE4.1. This will enable VG to build on arm64/aarch64 like the upcoming Apple Silicon, the well priced Graviton Amazon AWS servers, or the number 1 ranked supercomputer

mr-c avatar Aug 31 '20 07:08 mr-c

@ekg @adamnovak @jeizenga Looks like this PR needs a rebase/merge to deal with the conflict?

mr-c avatar Aug 31 '20 07:08 mr-c

Thank you all for fixing bugs, adding extremely useful features, and organizing the code for merging into this repository.

I took a quick look at the code and there are some conflicts. I have some overlapping bug fixes on my devel branch too, so I decided to merge this PR into master first and then make my devel master. Sorry but it takes some time.

(I noticed that current API documentation isn't enough. It would also be nice to have documentation about new features. The quality adjusted score matrix needs a proper understanding to use it appropriately, so I proceed to write the document once the merge is done. Perhaps I might ask about a specification I don't understand yet.)

Thanks,

ocxtal avatar Sep 06 '20 14:09 ocxtal

I agree, that's probably true about documenting the quality-adjusted scores. I'll try to summarize here. Basically, you have a separate score matrix for each Phred score, which are typically included with FASTQ files. There's more detail to how I compute thee score matrices, but I think dozeu can be agnostic about the theory underlying them. From memory, the main differences in dozeu are:

  • Record all of the score matrices with the aligner
  • Package the quality scores in the query
  • Extract DP scores using both sequence and quality scores
  • Do traceback using both sequence and quality scores

Unfortunately, I did have to de-vectorize one of the steps in the DP to use the quality scores because the score matrices aren't contiguous in 16 bytes anymore, which means you can't use _mm_shuffle_epi8 . This seems to lead to a pretty substantial slowdown (factor of ~2).

jeizenga avatar Sep 06 '20 19:09 jeizenga