gatk
gatk copied to clipboard
(DO NOT MERGE) Draft PR for reimplementation of annotation-based filtering tools.
Wow, thanks for the detailed comments so far, @davidbenjamin! But perhaps let's quickly chat before you go any further?
There are a lot of things you commented on---temporary integration tests using local files, lots of code/arguments/etc. intentionally copied verbatim over from VQSR/tranches, and entire tools (the "monolithic" GMMVariantTrain and ScikitLearnVariantTrain)---that are rather in flux or will be scrapped/cleaned up shortly. That said, the comments on the code inherited from VQSR will certainly be useful in this process!
But it might save you some time if we could chat so I can give you a rough orientation and perhaps point out where the vestigial VQSR code remains. I think focusing discussion on the high level design of the tools that are likely to stay would also be most useful at this stage. Feel free to throw something on my calendar!
In the end, I think we will probably just retain the BGMM backend + the versions of the tools in the "scalable" package. I left the "monolithic" GMMVariantTrain and ScikitLearnVariantTrain tools in this branch so I could do one round of tieout. That tieout came out OK, so I think we'll abandon the monolithic tools, along with all the associated code outside of the scalable package. If it helps, I can go ahead and remove that stuff from this draft PR.
@samuelklee I will put something on your calendar, but I will finish reviewing the last four files for my own benefit. I need to get my hands dirty to prepare for a big picture orientation.
@davidbenjamin I've significantly refactored the production code, see the last commit. Most of this refactoring was to done make the code for the accounting of different modes (SNP/INDEL/both x BGMM/python x non/allele-specific) more minimal and straightforward. I've also combined the score/apply steps using the TwoPassVariantWalker.
There's still lots of documentation, cleanup, and hardening/validation to be done, but most of the key methods and design choices have been documented, so I think it could be worth a quick review at this stage. Again, no need to nitpick code-style details, etc. (unless you really want to!) In the meantime, I'm going to do some more testing/tieout to make sure the refactor didn't break anything.
This covers ~1800 LOC, which is roughly 50% of the equivalent VQSR code. Even modulo the remaining work just mentioned, which may add a few hundred LOC, I think this is a decent improvement---additional functionality, stability, etc. notwithstanding!
There's stubs for adding the truth-sensitivity conversion you proposed---should be pretty straightforward. I think it should also still be pretty easy for future pushes to add features like extraction/downsampling of unlabeled data, etc., but please do keep an eye out for design choices that may ultimately be constraining.
@ldgauthier might be interested in taking a quick glance, as well.
One more note: I did try to make the score/apply single pass when using the Java BGMM backend (since we can just cheaply score as we go, in contrast to using a file interface with python), but this seemed to slow down variant writing a bit more than expected and it came out as a wash with the two-pass approach. Didn’t seem worth the extra code at this point, but maybe we can get it working better in the future.
I think I've resolved most of the non-BGMM comments in #7954, but will leave this open and address any remaining BGMM comments when that stuff goes in.