gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Add XGBoostGqVariantFilter, a tool to recalibrate GQ

Open TedBrookings opened this issue 2 years ago • 4 comments

  • Added MinGqVariantFilterBase
    • loads VCF, pedigree, UCSC genome tract, and truth data
    • calculates variant overlap with genome tracts
    • forms matrices, tensors, and other helping data for machine learning
    • provides for TRAIN and FILTER modes
    • provides functions for calculating loss given assigned min GQ values
    • computes best estimate of truth data used for training xgboost model
  • Added XGBoostMinGqVariantFilter
    • calculates new GQ based on gradient boosting
  • Added PropertiesTable for loading VCF properties into tensors
  • Added TractOverlapDetector for computing overlap properties with UCSC genome tracts

Training loss is based on weighted combination of heredity and truth data, broken down by variant category.

TedBrookings avatar Mar 02 '22 17:03 TedBrookings

There's a lot of stuff that I know is wrong here:

  1. This is based on a master that's super out of date (I don't want to rebase at this juncture, because I'd need to stop and verify that behavior didn't change due to something else changing in GATK)
  2. No unit tests. Up to this point, the basic structure has been changing a lot. It should be pretty well settled now though.
  3. Probably the main classes should be renamed to indicated that they are recalibrating GQ, not just filtering.
  4. I should probably put in a soft-filter option (just recalibrate GQ, don't set GT to no-call)
  5. Probably the output should be called something other than GQ. Phred-scaling is a bad match to probabilities near 50%, but people expect GQ to be Phred-scaled.
  6. Many of the default values are set at non-optimal values. I didn't want to rebuild the docker image each time I tweaked values, so those were tweaked from WDL settings instead. They should be set to something resembling "optimal" before final merge.

TedBrookings avatar Mar 02 '22 17:03 TedBrookings

Github actions tests reported job failures from actions build 2616727886 Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 2616727886.10 logs
unit 8 2616727886.1 logs
conda 8 2616727886.3 logs
variantcalling 8 2616727886.2 logs
integration 8 2616727886.0 logs

gatk-bot avatar Jul 05 '22 14:07 gatk-bot

Github actions tests reported job failures from actions build 3024497902 Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 3024497902.10 logs
unit 8 3024497902.1 logs
conda 8 3024497902.3 logs
integration 8 3024497902.0 logs
variantcalling 8 3024497902.2 logs

gatk-bot avatar Sep 09 '22 18:09 gatk-bot

Github actions tests reported job failures from actions build 3024517679 Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 3024517679.10 logs
unit 8 3024517679.1 logs
conda 8 3024517679.3 logs
variantcalling 8 3024517679.2 logs
integration 8 3024517679.0 logs

gatk-bot avatar Sep 09 '22 18:09 gatk-bot