gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Simplified genotype likelihood calculation (no change in output)

Open davidbenjamin opened this issue 5 years ago • 32 comments

@jamesemery Could you review this? I think you may appreciate it. It took several tries, but I was finally able to write a stripped-down version of the code that actually slightly outperforms the old version. What I realized after a lot of profiling the old code and various failed rewrites was that cache-friendliness is the critical thing here. It turns out that this can be achieved without too many buffers, without precomputing the log frequencies, and without storing 2D and 3D arrays as flattened 1D arrays.

davidbenjamin avatar Jan 07 '20 16:01 davidbenjamin

Do you have numbers for the performance here? How big is this code in the profiler before or after? I'm curious.

The unit tests are about 5% faster. In practice, this won't affect HaplotyeCaller because the overwhelming majority of the CPU cost of those tests comes from the ploidy = 20, allele count = 6 cases. For anything else the genotyping likelihoods calculation is not only not a bottleneck, it's completely negligible.

davidbenjamin avatar Jan 30 '20 16:01 davidbenjamin

Do we want to try to get this in at some point? @davidbenjamin if you'd like to push a quick rebase, I can try to take a stab at a review, although I'd guess @jamesemery would be better qualified.

samuelklee avatar Jan 21 '22 16:01 samuelklee

@samuelklee you are my hero. Let me see how hard the rebase is.

davidbenjamin avatar Jan 21 '22 20:01 davidbenjamin

@samuelklee It wasn't just a rebase, it was a complete rewrite because the old code had since become completely entangled with DRAGEN code. But I did it! Everything is passing, the code is dramatically simpler, and it's even a bit faster.

I have done my best to make a coherent commit history. I would recommend reviewing one commit at a time in side-by-side diff mode. Note that some commits rip out old code and replace it with pseudocode, deferring the new code to a later commit. Other commits tell a story of what all the different caches meant in order to motivate the simpification of later commits.

The baroqueness of the old code was motivated by three considerations:

  • cache-friendliness -- traversing all arrays by incrementing the innermost index, reads. This is absolutely essential.
  • flattening 3D arrays into 1D arrays. This was a premature optimization.
  • Precomputing addition operations -- this was misguided.

The DRAGEN code relied on these caches in a rather complex way, which fortunately turned out not to be necessary and which could be dramatically simplified. My notes on tracking all the variables from the parent genotype calculator down to the DRAGEN calculator are in this google doc: https://docs.google.com/document/d/1v6s57mUAwfj38nL3VdktjA059kYBkJfokq18IDy79E8/edit?usp=sharing

Good luck and don't hesitate to ask me to explain anything.

davidbenjamin avatar Jan 27 '22 21:01 davidbenjamin

Things left for later:

  • GenotypeIndexCalculator sometimes interacts with primitive arrays, sometimes with GenotypeAlleleCounts
  • GenotypeLikelihoodCalculator has extraneous responsibilities and doesn't interact with GenotypeAlleleCounts as well as it should.
  • alleleCountsToIndex(final GenotypeAlleleCounts newGAC, final int[] newToOldAlleleMap) in GenotypeIndexCalculator needs refactoring.
  • GenotypeLikelihoodCalculators is really just a cache of GenotypeAlleleCounts.
  • GenotypeAlleleCounts has some unused and barely-used methods, and it precomputes a lot of quantities that are not often needed and could be computed on-the-fly without difficulty or expense.

davidbenjamin avatar Mar 14 '22 05:03 davidbenjamin

@samuelklee I wrote some benchmarks for the exact combinatorics and you were right, my optimization was pointless. Although the CombinatoricsUtils method does explicitly multiply out instead of using cached factorials 1) the number of multiplications is only min(ploidy, (allele count - 1)), and 2) it actually takes quite a while (much larger than reasonable ploidy and allele count) for multiplication to take longer than the memory access of stored factorials.

I have removed this error in judgment.

davidbenjamin avatar Mar 14 '22 14:03 davidbenjamin

Travis reported job failures from build 38091 Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 38091.13 logs
integration openjdk11 38091.12 logs
unit openjdk8 38091.3 logs

gatk-bot avatar Mar 14 '22 15:03 gatk-bot

Travis reported job failures from build 38110 Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk11 38110.14 logs
cloud openjdk8 38110.1 logs
unit openjdk11 38110.13 logs
integration openjdk11 38110.12 logs
unit openjdk8 38110.3 logs
variantcalling openjdk8 38110.4 logs
integration openjdk8 38110.2 logs

gatk-bot avatar Mar 15 '22 07:03 gatk-bot

You know, I think I will clean up all the entangled genotype allele count caching and iterating.

I have benchmarked pretty thoroughly and discovered that caching GenotypeAlleleCounts for the sake of iterating in sequence is totally pointless. The GenotypeAlleleCounts::increase method is already so efficient that it makes no difference. In fact, caching is slower than using increase when the allele count and ploidy yield more than a few hundred genotypes. Caching is a bit faster for the commonest cases of 2 or 3 alleles in a diploid genotype, but the savings is less than a tenth of a second over an entire WGS run.

davidbenjamin avatar Mar 15 '22 13:03 davidbenjamin

@samuelklee Now it's back to you. I agreed with and implemented all of your suggestions. GenotypeIndexCalculator, GenotypeAlleleCounts, GenotypeLikelihoodsCalculator and GenotypeLikelihoodsCalculator (renamed to GenotypesCache) now have clearly-defined roles. A lot of premature optimization is gone.

davidbenjamin avatar Mar 16 '22 03:03 davidbenjamin

Thanks @davidbenjamin for essentially refactoring this code three times now!

Looking forward to reviewing your latest changes, but it may have to wait until early next week. Apologies for the delay. In the meantime, I see that there is a minor rebase conflict—up to you if you want to address it now, no biggie if you want to wait until after review.

samuelklee avatar Mar 17 '22 14:03 samuelklee

@davidbenjamin just letting you know---I will try my best to get this back to you before I head out on vacation next week! Hopefully still well before the other delaying factor.

samuelklee avatar Apr 12 '22 20:04 samuelklee

@davidbenjamin as we just discussed, back from vacation now and modulo other circumstances (which we also just discussed) should be able to get this back to you in a week. If you could squash the commits after the last review, that might actually be a little cleaner for me to review---perhaps comment a copy of the squashed commit messages here if you'd like to keep them for posterity. Thanks!

samuelklee avatar Apr 28 '22 14:04 samuelklee

@davidbenjamin any bandwidth for keeping up momentum on this one? Maybe we can try to get a rebase and activate GitHub Actions so we can see where we're at?

samuelklee avatar Jun 27 '22 15:06 samuelklee

Github actions tests reported job failures from actions build 2799863346 Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 11 2799863346.11 logs
cloud 8 2799863346.10 logs
unit 11 2799863346.13 logs
integration 11 2799863346.12 logs
unit 8 2799863346.1 logs
integration 8 2799863346.0 logs

gatk-bot avatar Aug 04 '22 22:08 gatk-bot

Github actions tests reported job failures from actions build 2811164603 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2811164603.13 logs
integration 11 2811164603.12 logs
unit 8 2811164603.1 logs
integration 8 2811164603.0 logs

gatk-bot avatar Aug 07 '22 03:08 gatk-bot

Codecov Report

Merging #6351 (28549c9) into master (b370c06) will increase coverage by 0.361%. The diff coverage is 93.182%.

:exclamation: Current head 28549c9 differs from pull request most recent head 440130f. Consider uploading reports for the commit 440130f to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##              master     #6351       +/-   ##
===============================================
+ Coverage     86.297%   86.658%   +0.361%     
+ Complexity     39445     38662      -783     
===============================================
  Files           2351      2329       -22     
  Lines         185118    181309     -3809     
  Branches       20397     19882      -515     
===============================================
- Hits          159752    157119     -2633     
+ Misses         18185     17210      -975     
+ Partials        7181      6980      -201     
Impacted Files Coverage Δ
...tools/copynumber/ModelSegmentsIntegrationTest.java 96.078% <ø> (ø)
...dinstitute/hellbender/utils/MathUtilsUnitTest.java 92.606% <ø> (+0.842%) :arrow_up:
...s/help/DocumentationGenerationIntegrationTest.java 23.529% <66.667%> (+9.244%) :arrow_up:
.../broadinstitute/hellbender/utils/MannWhitneyU.java 94.118% <75.000%> (-0.026%) :arrow_down:
...ender/utils/genotyper/GenotypePriorCalculator.java 88.462% <80.000%> (+0.962%) :arrow_up:
...bender/utils/recalibration/RecalDatumUnitTest.java 95.238% <80.000%> (ø)
...ols/walkers/genotyper/GenotypeIndexCalculator.java 83.019% <83.019%> (ø)
...bender/tools/walkers/genotyper/GenotypesCache.java 86.957% <86.957%> (ø)
...rs/genotyper/afcalc/AlleleFrequencyCalculator.java 84.328% <86.957%> (+2.106%) :arrow_up:
...ute/hellbender/utils/recalibration/RecalDatum.java 87.500% <90.909%> (-0.735%) :arrow_down:
... and 25 more

... and 287 files with indirect coverage changes

codecov[bot] avatar Aug 07 '22 13:08 codecov[bot]

Github actions tests reported job failures from actions build 2812723684 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 11 2812723684.12 logs
integration 8 2812723684.0 logs

gatk-bot avatar Aug 07 '22 13:08 gatk-bot

Github actions tests reported job failures from actions build 2814070537 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2814070537.0 logs
integration 8 2814070537.0 logs

gatk-bot avatar Aug 07 '22 22:08 gatk-bot

Github actions tests reported job failures from actions build 2822355346 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2822355346.0 logs

gatk-bot avatar Aug 09 '22 04:08 gatk-bot

Github actions tests reported job failures from actions build 2837191529 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2837191529.0 logs

gatk-bot avatar Aug 11 '22 04:08 gatk-bot

Github actions tests reported job failures from actions build 2837453255 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2837453255.0 logs

gatk-bot avatar Aug 11 '22 06:08 gatk-bot

Github actions tests reported job failures from actions build 2842681984 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2842681984.0 logs

gatk-bot avatar Aug 11 '22 22:08 gatk-bot

Github actions tests reported job failures from actions build 2856248753 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2856248753.0 logs

gatk-bot avatar Aug 14 '22 16:08 gatk-bot

Github actions tests reported job failures from actions build 2863005179 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2863005179.0 logs

gatk-bot avatar Aug 15 '22 21:08 gatk-bot

Github actions tests reported job failures from actions build 2863935609 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2863935609.0 logs

gatk-bot avatar Aug 15 '22 23:08 gatk-bot

Github actions tests reported job failures from actions build 2864043518 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 8 2864043518.0 logs

gatk-bot avatar Aug 16 '22 00:08 gatk-bot

@davidbenjamin Sorry I've been out and just got around to looking at this. Given that this test appears to run just fine in the Java 11 job (which is not run on our docker), I suspect the failure may have something to do with the jar file we use to test on the docker (which is not the same jar we use on the non-docker tests).

I pulled your branch and all of the generation tasks (gatk doc, wdl gen, javadoc) seem to work fine, so given how much time it looks like this has taken up, I think it would make sense to either disable this test (on the docker only - see below - since we want it to still run in the other CI integration test job), or else remove the variantcalling package from the test package list (if thats the one thats causing the failure ?). And then create a ticket for me with whatever data you have, which I'll follow up on.

If you restore everything to its natural state, you should be able to add this to the DocumentationGenerationIntegrationTest.documentationSmokeTest method and then it will be skipped only when running on the docker:

final DocumentationGenerationIntegrationTest dt = new DocumentationGenerationIntegrationTest();
if (dt.isGATKDockerContainer()) {
        throw new SkipException("See gatk issue #...");
}

cmnbroad avatar Aug 16 '22 13:08 cmnbroad

@cmnbroad I am grateful. I have been trying to huint down the source of this error without success for the last week.

I created the ticket for you. Good luck.

davidbenjamin avatar Aug 16 '22 22:08 davidbenjamin

@samuelklee After disabling the offending test with Chris Norman's blessing, this is ready for your review again. I butchered the commit history during the rebase but in the conversation above you can see everything I did. It's only about five things since the last time.

davidbenjamin avatar Aug 16 '22 22:08 davidbenjamin