gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Initial support for spanning deletions and refactoring for testability

Open kcibul opened this issue 2 years ago • 9 comments

Spanning Deletion Changes

  1. Created a new SpanningDeletionRecord as a subclass of ReferenceRecord but allows us to store GT and GQ
  2. in handlePotentialSpanningDeletion when processing a deletion, we create a new SpanningDeletionRecord with the correct GT and GQ based on the deletion
  3. When processing reference data at a variant site, return ReferenceRecords/SpanningDeletionRecord instead of just a string "state" since we need more than just state now
  4. Because of the above, we are now returning an object for the inferred state instead of just a string. Since the inferred state is so, so common a singleton InferredReferenceRecord was created
  5. Processing of spanning deletions beyond

Ugly

  1. The construction of the singleton is ugly because it requires a location even though we don't for this case. We could go to an tagging interface (like Cloneable) these all implement, but that seems ugly also

Refactoring Changes One of the challenges with this PR was testing as the work is really done in the lower-level methods and it would be nice to have this as a unit test rather than an integration/end-to-end test. This motivated the following changes:

  1. don't write to VCF directly, instead have take a Consumer<VariantContext> to emit VariantContexts. This let's us provide a different consumer in unit tests to collect our result.
  2. we previously had a chain of calls createVariantsFromSortedRanges -> processSampleRecordsForLocation -> finalizeCurrentVariant that returned void and as a side effect wrote to VCF. These deeper methods now return a VariantContext and the writing (via consumer) is done higher up in the call stack
  3. made some private methods package-private so we could call them from tests

Thinking Out Loud

We have three different sets of datastructures for the same data, some of this is history, some is performance/memory, but could use some rethinking

  1. GenericRecord (pulling from BQ)
  2. ReferenceRecord/SpanningDeletionRecord (in memory data structure for reference data)
  3. ExtractCohortRecord (pre-variant context record)

kcibul avatar May 18 '22 17:05 kcibul

Codecov Report

:exclamation: No coverage uploaded for pull request base (ah_var_store@8781b56). Click here to learn what that means. The diff coverage is n/a.

@@               Coverage Diff                @@
##             ah_var_store     #7857   +/-   ##
================================================
  Coverage                ?   86.240%           
  Complexity              ?     35196           
================================================
  Files                   ?      2173           
  Lines                   ?    165018           
  Branches                ?     17793           
================================================
  Hits                    ?    142311           
  Misses                  ?     16380           
  Partials                ?      6327           

codecov[bot] avatar May 18 '22 18:05 codecov[bot]

Github actions tests reported job failures from actions build 2347277957 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2347277957.13 logs
unit 8 2347277957.1 logs

gatk-bot avatar May 18 '22 18:05 gatk-bot

Github actions tests reported job failures from actions build 2347281121 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2347281121.13 logs
unit 8 2347281121.1 logs

gatk-bot avatar May 18 '22 18:05 gatk-bot

Github actions tests reported job failures from actions build 2360686024 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2360686024.13 logs
unit 8 2360686024.1 logs

gatk-bot avatar May 20 '22 21:05 gatk-bot

PR at https://github.com/broadinstitute/gatk/pull/7857

kcibul avatar May 23 '22 14:05 kcibul

I grok the refactoring changes and those are 👍, but I think I'm going to need a walkthrough for the core spanning deletion work...

mcovarr avatar May 23 '22 23:05 mcovarr

Github actions tests reported job failures from actions build 2488811363 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2488811363.13 logs
unit 8 2488811363.1 logs

gatk-bot avatar Jun 13 '22 14:06 gatk-bot

Github actions tests reported job failures from actions build 2488817096 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2488817096.13 logs
unit 8 2488817096.1 logs

gatk-bot avatar Jun 13 '22 14:06 gatk-bot

Github actions tests reported job failures from actions build 2545363458 Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2545363458.13 logs
unit 8 2545363458.1 logs

gatk-bot avatar Jun 22 '22 22:06 gatk-bot