gatk icon indicating copy to clipboard operation
gatk copied to clipboard

new PasteDepthEvidence tool

Open tedsharpe opened this issue 2 years ago • 1 comments

Bincov generation in one step. Need to think about whether we need to sample to discover the full bin size (like in the current SetBins task), or whether we'll always supply it as an explicit parameter of the workflow, or whether the approach I took here (set it from the first row, unless explicitly provided) is adequate. If not, I'll write code to buffer some rows to figure out the bin size before writing anything to the output file.

tedsharpe avatar Sep 22 '22 18:09 tedsharpe

Codecov Report

Merging #8031 (8bf70b4) into master (0261d43) will decrease coverage by 0.014%. The diff coverage is 81.051%.

:exclamation: Current head 8bf70b4 differs from pull request most recent head 190ff8d. Consider uploading reports for the commit 190ff8d to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##              master     #8031       +/-   ##
===============================================
- Coverage     86.648%   86.635%   -0.013%     
- Complexity     38974     38992       +18     
===============================================
  Files           2337      2341        +4     
  Lines         182791    182958      +167     
  Branches       20067     20114       +47     
===============================================
+ Hits          158385    158505      +120     
- Misses         17364     17398       +34     
- Partials        7042      7055       +13     
Impacted Files Coverage Δ
...e/hellbender/engine/FeatureDataSourceUnitTest.java 95.146% <ø> (+0.612%) :arrow_up:
...tute/hellbender/engine/FeatureManagerUnitTest.java 90.826% <ø> (ø)
...llbender/engine/FeatureSupportIntegrationTest.java 100.000% <ø> (ø)
...ute/hellbender/utils/codecs/AbstractTextCodec.java 28.571% <28.571%> (ø)
...adinstitute/hellbender/tools/IndexFeatureFile.java 61.333% <38.298%> (-38.667%) :arrow_down:
...lbender/utils/codecs/FeatureOutputCodecFinder.java 73.077% <50.000%> (-6.090%) :arrow_down:
...roadinstitute/hellbender/engine/FeatureWalker.java 92.308% <66.667%> (ø)
...stitute/hellbender/utils/io/TextFeatureReader.java 78.125% <78.125%> (ø)
...titute/hellbender/tools/sv/PasteDepthEvidence.java 78.788% <78.788%> (ø)
...e/hellbender/engine/TabularMultiFeatureWalker.java 80.000% <80.000%> (ø)
... and 76 more

codecov[bot] avatar Sep 22 '22 19:09 codecov[bot]

Github actions tests reported job failures from actions build 3255713647 Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 3255713647.10 logs
cloud 11 3255713647.11 logs
unit 11 3255713647.13 logs
integration 11 3255713647.12 logs
unit 8 3255713647.1 logs
integration 8 3255713647.0 logs

gatk-bot avatar Oct 15 '22 12:10 gatk-bot

Github actions tests reported job failures from actions build 3255830437 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 3255830437.13 logs
integration 11 3255830437.12 logs
unit 8 3255830437.1 logs
integration 8 3255830437.0 logs

gatk-bot avatar Oct 15 '22 13:10 gatk-bot

Github actions tests reported job failures from actions build 3257265858 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 3257265858.13 logs
integration 11 3257265858.12 logs
unit 8 3257265858.1 logs
integration 8 3257265858.0 logs

gatk-bot avatar Oct 15 '22 22:10 gatk-bot

Github actions tests reported job failures from actions build 3259842595 Failures in the following jobs:

Test Type JDK Job ID Logs
integration 11 3259842595.12 logs
integration 8 3259842595.0 logs

gatk-bot avatar Oct 16 '22 15:10 gatk-bot

Rationale for engine changes: This tool opens a large number of feature files (TSVs, not VariantContexts) and iterates over them simultaneously. No querying, just a single pass through each. Issue 1: When a feature file lives in the cloud, it takes unacceptably long (several seconds, typically) to initialize it. A few seconds doesn't seem like a long time, but when there are large numbers of feature files to open, it adds up. This is caused by a large number of codecs (mostly the vcf-processing codecs) opening and reading the first few bytes of the file in the canDecode method. To avoid this I've reversed the order in which we test each codec, checking first if it produces the correct subtype of Feature, and only then calling canDecode. If you don't know what specific subtype you need, you can just ask for any Feature by passing Feature.class. It's much faster that way. Issue 2: Each open feature source soaks up a huge amount of memory. That's because text-based feature reading is optimized for VCFs, which can have enormously long lines. So huge buffers are allocated. The problem is compounded for cloud-based feature files for which we allocate a large cloud prefetch buffer. (Though that feature can be turned off, which helps a little.) But the biggest memory hog is the TabixReader, which always reads in the index, regardless of whether it's used or not. Tabix indices are very large. To avoid this, I've created a smaller, simpler FeatureReader subclass called a TextFeatureReader that loads the index only when necessary. The revisions allow the new tool to run using an order of magnitude less memory. Faster, too. Issue 3: The code in FeatureDataSource that creates a FeatureReader is brittle, and tests for various subclasses. To allow use of the new TextFeatureReader, I added a FeatureReaderFactory interface that allows one to ask the codec for an appropriate FeatureReader.

tedsharpe avatar Oct 19 '22 17:10 tedsharpe

Reassigning to @lbergelson to review the engine changes in this PR in my absence.

droazen avatar Dec 06 '22 06:12 droazen

Results of merging 3 files with ~10,000,000 features each from a GCS bucket using my home linux box: CPB 1 is not quite twice as fast as CPB 0, so you were right. CPB > 1 just slows things down.

(Earlier comments erased, because they were stupid: I forgot that I hadn't hooked up prefetch.)

tedsharpe avatar Dec 19 '22 20:12 tedsharpe

@tedsharpe Thanks for checking. In general I've seen the CPB tends to help a lot when reading through long contiguous stretches of BAM file and less when doing anything on smaller or fragmented data. I'm surprised it didn't make any difference here, but seems like it doesn't so that's fine.

I've seen catastrophic interactions between insufficiently buffered index inputs with it disabled where it ended up performing an http request for every byte, but hopefully that's avoided just by using a buffered reader for it.

I have a plan to someday enable the more intelligent -L aware prefetcher that will use the list of actual positions of interest to buffer more intelligently, but that's not happening on any specific schedule.

lbergelson avatar Dec 19 '22 20:12 lbergelson

I believe all your comments have been addressed. Thanks very much for the review. It was helpful.

tedsharpe avatar Dec 22 '22 20:12 tedsharpe

Oh: The main reason for reimplementing a text-based feature reader is that the existing Tabix reader reads the entire index (which is huge) into memory right in the constructor. Also, the abstract tabix parsing code doesn't work on some of the weird files people have invented that have odd combinations of header lines. I think there were other reasons, too, that I've forgotten. I studied it a while to see if I could fix it, but eventually came to the conclusion that it was hopeless.

tedsharpe avatar Dec 22 '22 20:12 tedsharpe

@tedsharpe So we've addressed a bunch of the issues you've mentioned in the new htsjdk.beta reader apis. In particular we now optimize the decoding checks so that we only scan the beginning of the files exactly once and the give the necessary bytes to all available codecs. That should work in combination with your optimization to push down the filtering which makes a lot of sense as well.

Lazily loading the indexes when you need to query for them is a good idea. I think the thought behind aggressively loading them was probably to handle potential errors up front instead of waiting until later, and it's confounded by the bad choice of automatically discovering indexes by default so the engine can't tell if there SHOULD be an index until it tries and fails to load it. We've also addressed that in the new APIs to give the client more control over how indexes are discovered which should allow for cleaner lazy loading and things like that.

lbergelson avatar Dec 23 '22 20:12 lbergelson