gatk
gatk copied to clipboard
Simplified genotype likelihood calculation (no change in output)
@jamesemery Could you review this? I think you may appreciate it. It took several tries, but I was finally able to write a stripped-down version of the code that actually slightly outperforms the old version. What I realized after a lot of profiling the old code and various failed rewrites was that cache-friendliness is the critical thing here. It turns out that this can be achieved without too many buffers, without precomputing the log frequencies, and without storing 2D and 3D arrays as flattened 1D arrays.
Do you have numbers for the performance here? How big is this code in the profiler before or after? I'm curious.
The unit tests are about 5% faster. In practice, this won't affect HaplotyeCaller because the overwhelming majority of the CPU cost of those tests comes from the ploidy = 20, allele count = 6 cases. For anything else the genotyping likelihoods calculation is not only not a bottleneck, it's completely negligible.
Do we want to try to get this in at some point? @davidbenjamin if you'd like to push a quick rebase, I can try to take a stab at a review, although I'd guess @jamesemery would be better qualified.
@samuelklee you are my hero. Let me see how hard the rebase is.
@samuelklee It wasn't just a rebase, it was a complete rewrite because the old code had since become completely entangled with DRAGEN code. But I did it! Everything is passing, the code is dramatically simpler, and it's even a bit faster.
I have done my best to make a coherent commit history. I would recommend reviewing one commit at a time in side-by-side diff mode. Note that some commits rip out old code and replace it with pseudocode, deferring the new code to a later commit. Other commits tell a story of what all the different caches meant in order to motivate the simpification of later commits.
The baroqueness of the old code was motivated by three considerations:
- cache-friendliness -- traversing all arrays by incrementing the innermost index, reads. This is absolutely essential.
- flattening 3D arrays into 1D arrays. This was a premature optimization.
- Precomputing addition operations -- this was misguided.
The DRAGEN code relied on these caches in a rather complex way, which fortunately turned out not to be necessary and which could be dramatically simplified. My notes on tracking all the variables from the parent genotype calculator down to the DRAGEN calculator are in this google doc: https://docs.google.com/document/d/1v6s57mUAwfj38nL3VdktjA059kYBkJfokq18IDy79E8/edit?usp=sharing
Good luck and don't hesitate to ask me to explain anything.
Things left for later:
-
GenotypeIndexCalculator
sometimes interacts with primitive arrays, sometimes withGenotypeAlleleCounts
-
GenotypeLikelihoodCalculator
has extraneous responsibilities and doesn't interact withGenotypeAlleleCounts
as well as it should. -
alleleCountsToIndex(final GenotypeAlleleCounts newGAC, final int[] newToOldAlleleMap)
inGenotypeIndexCalculator
needs refactoring. -
GenotypeLikelihoodCalculators
is really just a cache ofGenotypeAlleleCounts
. -
GenotypeAlleleCounts
has some unused and barely-used methods, and it precomputes a lot of quantities that are not often needed and could be computed on-the-fly without difficulty or expense.
@samuelklee I wrote some benchmarks for the exact combinatorics and you were right, my optimization was pointless. Although the CombinatoricsUtils
method does explicitly multiply out instead of using cached factorials 1) the number of multiplications is only min(ploidy, (allele count - 1)), and 2) it actually takes quite a while (much larger than reasonable ploidy and allele count) for multiplication to take longer than the memory access of stored factorials.
I have removed this error in judgment.
Travis reported job failures from build 38091 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
unit | openjdk11 | 38091.13 | logs |
integration | openjdk11 | 38091.12 | logs |
unit | openjdk8 | 38091.3 | logs |
Travis reported job failures from build 38110 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
cloud | openjdk11 | 38110.14 | logs |
cloud | openjdk8 | 38110.1 | logs |
unit | openjdk11 | 38110.13 | logs |
integration | openjdk11 | 38110.12 | logs |
unit | openjdk8 | 38110.3 | logs |
variantcalling | openjdk8 | 38110.4 | logs |
integration | openjdk8 | 38110.2 | logs |
You know, I think I will clean up all the entangled genotype allele count caching and iterating.
I have benchmarked pretty thoroughly and discovered that caching GenotypeAlleleCounts
for the sake of iterating in sequence is totally pointless. The GenotypeAlleleCounts::increase
method is already so efficient that it makes no difference. In fact, caching is slower than using increase
when the allele count and ploidy yield more than a few hundred genotypes. Caching is a bit faster for the commonest cases of 2 or 3 alleles in a diploid genotype, but the savings is less than a tenth of a second over an entire WGS run.
@samuelklee Now it's back to you. I agreed with and implemented all of your suggestions. GenotypeIndexCalculator
, GenotypeAlleleCounts
, GenotypeLikelihoodsCalculator
and GenotypeLikelihoodsCalculator
(renamed to GenotypesCache
) now have clearly-defined roles. A lot of premature optimization is gone.
Thanks @davidbenjamin for essentially refactoring this code three times now!
Looking forward to reviewing your latest changes, but it may have to wait until early next week. Apologies for the delay. In the meantime, I see that there is a minor rebase conflict—up to you if you want to address it now, no biggie if you want to wait until after review.
@davidbenjamin just letting you know---I will try my best to get this back to you before I head out on vacation next week! Hopefully still well before the other delaying factor.
@davidbenjamin as we just discussed, back from vacation now and modulo other circumstances (which we also just discussed) should be able to get this back to you in a week. If you could squash the commits after the last review, that might actually be a little cleaner for me to review---perhaps comment a copy of the squashed commit messages here if you'd like to keep them for posterity. Thanks!
@davidbenjamin any bandwidth for keeping up momentum on this one? Maybe we can try to get a rebase and activate GitHub Actions so we can see where we're at?
Github actions tests reported job failures from actions build 2799863346 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
cloud | 11 | 2799863346.11 | logs |
cloud | 8 | 2799863346.10 | logs |
unit | 11 | 2799863346.13 | logs |
integration | 11 | 2799863346.12 | logs |
unit | 8 | 2799863346.1 | logs |
integration | 8 | 2799863346.0 | logs |
Github actions tests reported job failures from actions build 2811164603 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
unit | 11 | 2811164603.13 | logs |
integration | 11 | 2811164603.12 | logs |
unit | 8 | 2811164603.1 | logs |
integration | 8 | 2811164603.0 | logs |
Codecov Report
Merging #6351 (28549c9) into master (b370c06) will increase coverage by
0.361%
. The diff coverage is93.182%
.
:exclamation: Current head 28549c9 differs from pull request most recent head 440130f. Consider uploading reports for the commit 440130f to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #6351 +/- ##
===============================================
+ Coverage 86.297% 86.658% +0.361%
+ Complexity 39445 38662 -783
===============================================
Files 2351 2329 -22
Lines 185118 181309 -3809
Branches 20397 19882 -515
===============================================
- Hits 159752 157119 -2633
+ Misses 18185 17210 -975
+ Partials 7181 6980 -201
Impacted Files | Coverage Δ | |
---|---|---|
...tools/copynumber/ModelSegmentsIntegrationTest.java | 96.078% <ø> (ø) |
|
...dinstitute/hellbender/utils/MathUtilsUnitTest.java | 92.606% <ø> (+0.842%) |
:arrow_up: |
...s/help/DocumentationGenerationIntegrationTest.java | 23.529% <66.667%> (+9.244%) |
:arrow_up: |
.../broadinstitute/hellbender/utils/MannWhitneyU.java | 94.118% <75.000%> (-0.026%) |
:arrow_down: |
...ender/utils/genotyper/GenotypePriorCalculator.java | 88.462% <80.000%> (+0.962%) |
:arrow_up: |
...bender/utils/recalibration/RecalDatumUnitTest.java | 95.238% <80.000%> (ø) |
|
...ols/walkers/genotyper/GenotypeIndexCalculator.java | 83.019% <83.019%> (ø) |
|
...bender/tools/walkers/genotyper/GenotypesCache.java | 86.957% <86.957%> (ø) |
|
...rs/genotyper/afcalc/AlleleFrequencyCalculator.java | 84.328% <86.957%> (+2.106%) |
:arrow_up: |
...ute/hellbender/utils/recalibration/RecalDatum.java | 87.500% <90.909%> (-0.735%) |
:arrow_down: |
... and 25 more |
Github actions tests reported job failures from actions build 2812723684 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 11 | 2812723684.12 | logs |
integration | 8 | 2812723684.0 | logs |
Github actions tests reported job failures from actions build 2814070537 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2814070537.0 | logs |
integration | 8 | 2814070537.0 | logs |
Github actions tests reported job failures from actions build 2822355346 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2822355346.0 | logs |
Github actions tests reported job failures from actions build 2837191529 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2837191529.0 | logs |
Github actions tests reported job failures from actions build 2837453255 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2837453255.0 | logs |
Github actions tests reported job failures from actions build 2842681984 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2842681984.0 | logs |
Github actions tests reported job failures from actions build 2856248753 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2856248753.0 | logs |
Github actions tests reported job failures from actions build 2863005179 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2863005179.0 | logs |
Github actions tests reported job failures from actions build 2863935609 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2863935609.0 | logs |
Github actions tests reported job failures from actions build 2864043518 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 8 | 2864043518.0 | logs |
@davidbenjamin Sorry I've been out and just got around to looking at this. Given that this test appears to run just fine in the Java 11 job (which is not run on our docker), I suspect the failure may have something to do with the jar file we use to test on the docker (which is not the same jar we use on the non-docker tests).
I pulled your branch and all of the generation tasks (gatk doc, wdl gen, javadoc) seem to work fine, so given how much time it looks like this has taken up, I think it would make sense to either disable this test (on the docker only - see below - since we want it to still run in the other CI integration test job), or else remove the variantcalling package from the test package list (if thats the one thats causing the failure ?). And then create a ticket for me with whatever data you have, which I'll follow up on.
If you restore everything to its natural state, you should be able to add this to the DocumentationGenerationIntegrationTest.documentationSmokeTest
method and then it will be skipped only when running on the docker:
final DocumentationGenerationIntegrationTest dt = new DocumentationGenerationIntegrationTest();
if (dt.isGATKDockerContainer()) {
throw new SkipException("See gatk issue #...");
}
@cmnbroad I am grateful. I have been trying to huint down the source of this error without success for the last week.
I created the ticket for you. Good luck.
@samuelklee After disabling the offending test with Chris Norman's blessing, this is ready for your review again. I butchered the commit history during the rebase but in the conversation above you can see everything I did. It's only about five things since the last time.