gatk
gatk copied to clipboard
new PasteDepthEvidence tool
Bincov generation in one step. Need to think about whether we need to sample to discover the full bin size (like in the current SetBins task), or whether we'll always supply it as an explicit parameter of the workflow, or whether the approach I took here (set it from the first row, unless explicitly provided) is adequate. If not, I'll write code to buffer some rows to figure out the bin size before writing anything to the output file.
Codecov Report
Merging #8031 (8bf70b4) into master (0261d43) will decrease coverage by
0.014%
. The diff coverage is81.051%
.
:exclamation: Current head 8bf70b4 differs from pull request most recent head 190ff8d. Consider uploading reports for the commit 190ff8d to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #8031 +/- ##
===============================================
- Coverage 86.648% 86.635% -0.013%
- Complexity 38974 38992 +18
===============================================
Files 2337 2341 +4
Lines 182791 182958 +167
Branches 20067 20114 +47
===============================================
+ Hits 158385 158505 +120
- Misses 17364 17398 +34
- Partials 7042 7055 +13
Impacted Files | Coverage Δ | |
---|---|---|
...e/hellbender/engine/FeatureDataSourceUnitTest.java | 95.146% <ø> (+0.612%) |
:arrow_up: |
...tute/hellbender/engine/FeatureManagerUnitTest.java | 90.826% <ø> (ø) |
|
...llbender/engine/FeatureSupportIntegrationTest.java | 100.000% <ø> (ø) |
|
...ute/hellbender/utils/codecs/AbstractTextCodec.java | 28.571% <28.571%> (ø) |
|
...adinstitute/hellbender/tools/IndexFeatureFile.java | 61.333% <38.298%> (-38.667%) |
:arrow_down: |
...lbender/utils/codecs/FeatureOutputCodecFinder.java | 73.077% <50.000%> (-6.090%) |
:arrow_down: |
...roadinstitute/hellbender/engine/FeatureWalker.java | 92.308% <66.667%> (ø) |
|
...stitute/hellbender/utils/io/TextFeatureReader.java | 78.125% <78.125%> (ø) |
|
...titute/hellbender/tools/sv/PasteDepthEvidence.java | 78.788% <78.788%> (ø) |
|
...e/hellbender/engine/TabularMultiFeatureWalker.java | 80.000% <80.000%> (ø) |
|
... and 76 more |
Github actions tests reported job failures from actions build 3255713647 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
cloud | 8 | 3255713647.10 | logs |
cloud | 11 | 3255713647.11 | logs |
unit | 11 | 3255713647.13 | logs |
integration | 11 | 3255713647.12 | logs |
unit | 8 | 3255713647.1 | logs |
integration | 8 | 3255713647.0 | logs |
Github actions tests reported job failures from actions build 3255830437 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
unit | 11 | 3255830437.13 | logs |
integration | 11 | 3255830437.12 | logs |
unit | 8 | 3255830437.1 | logs |
integration | 8 | 3255830437.0 | logs |
Github actions tests reported job failures from actions build 3257265858 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
unit | 11 | 3257265858.13 | logs |
integration | 11 | 3257265858.12 | logs |
unit | 8 | 3257265858.1 | logs |
integration | 8 | 3257265858.0 | logs |
Github actions tests reported job failures from actions build 3259842595 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
integration | 11 | 3259842595.12 | logs |
integration | 8 | 3259842595.0 | logs |
Rationale for engine changes: This tool opens a large number of feature files (TSVs, not VariantContexts) and iterates over them simultaneously. No querying, just a single pass through each. Issue 1: When a feature file lives in the cloud, it takes unacceptably long (several seconds, typically) to initialize it. A few seconds doesn't seem like a long time, but when there are large numbers of feature files to open, it adds up. This is caused by a large number of codecs (mostly the vcf-processing codecs) opening and reading the first few bytes of the file in the canDecode method. To avoid this I've reversed the order in which we test each codec, checking first if it produces the correct subtype of Feature, and only then calling canDecode. If you don't know what specific subtype you need, you can just ask for any Feature by passing Feature.class. It's much faster that way. Issue 2: Each open feature source soaks up a huge amount of memory. That's because text-based feature reading is optimized for VCFs, which can have enormously long lines. So huge buffers are allocated. The problem is compounded for cloud-based feature files for which we allocate a large cloud prefetch buffer. (Though that feature can be turned off, which helps a little.) But the biggest memory hog is the TabixReader, which always reads in the index, regardless of whether it's used or not. Tabix indices are very large. To avoid this, I've created a smaller, simpler FeatureReader subclass called a TextFeatureReader that loads the index only when necessary. The revisions allow the new tool to run using an order of magnitude less memory. Faster, too. Issue 3: The code in FeatureDataSource that creates a FeatureReader is brittle, and tests for various subclasses. To allow use of the new TextFeatureReader, I added a FeatureReaderFactory interface that allows one to ask the codec for an appropriate FeatureReader.
Reassigning to @lbergelson to review the engine changes in this PR in my absence.
Results of merging 3 files with ~10,000,000 features each from a GCS bucket using my home linux box: CPB 1 is not quite twice as fast as CPB 0, so you were right. CPB > 1 just slows things down.
(Earlier comments erased, because they were stupid: I forgot that I hadn't hooked up prefetch.)
@tedsharpe Thanks for checking. In general I've seen the CPB tends to help a lot when reading through long contiguous stretches of BAM file and less when doing anything on smaller or fragmented data. I'm surprised it didn't make any difference here, but seems like it doesn't so that's fine.
I've seen catastrophic interactions between insufficiently buffered index inputs with it disabled where it ended up performing an http request for every byte, but hopefully that's avoided just by using a buffered reader for it.
I have a plan to someday enable the more intelligent -L aware prefetcher that will use the list of actual positions of interest to buffer more intelligently, but that's not happening on any specific schedule.
I believe all your comments have been addressed. Thanks very much for the review. It was helpful.
Oh: The main reason for reimplementing a text-based feature reader is that the existing Tabix reader reads the entire index (which is huge) into memory right in the constructor. Also, the abstract tabix parsing code doesn't work on some of the weird files people have invented that have odd combinations of header lines. I think there were other reasons, too, that I've forgotten. I studied it a while to see if I could fix it, but eventually came to the conclusion that it was hopeless.
@tedsharpe So we've addressed a bunch of the issues you've mentioned in the new htsjdk.beta reader apis. In particular we now optimize the decoding checks so that we only scan the beginning of the files exactly once and the give the necessary bytes to all available codecs. That should work in combination with your optimization to push down the filtering which makes a lot of sense as well.
Lazily loading the indexes when you need to query for them is a good idea. I think the thought behind aggressively loading them was probably to handle potential errors up front instead of waiting until later, and it's confounded by the bad choice of automatically discovering indexes by default so the engine can't tell if there SHOULD be an index until it tries and fails to load it. We've also addressed that in the new APIs to give the client more control over how indexes are discovered which should allow for cleaner lazy loading and things like that.