gatk
gatk copied to clipboard
Add XGBoostGqVariantFilter, a tool to recalibrate GQ
- Added MinGqVariantFilterBase
-
- loads VCF, pedigree, UCSC genome tract, and truth data
-
- calculates variant overlap with genome tracts
-
- forms matrices, tensors, and other helping data for machine learning
-
- provides for TRAIN and FILTER modes
-
- provides functions for calculating loss given assigned min GQ values
-
- computes best estimate of truth data used for training xgboost model
- Added XGBoostMinGqVariantFilter
-
- calculates new GQ based on gradient boosting
- Added PropertiesTable for loading VCF properties into tensors
- Added TractOverlapDetector for computing overlap properties with UCSC genome tracts
Training loss is based on weighted combination of heredity and truth data, broken down by variant category.
There's a lot of stuff that I know is wrong here:
- This is based on a master that's super out of date (I don't want to rebase at this juncture, because I'd need to stop and verify that behavior didn't change due to something else changing in GATK)
- No unit tests. Up to this point, the basic structure has been changing a lot. It should be pretty well settled now though.
- Probably the main classes should be renamed to indicated that they are recalibrating GQ, not just filtering.
- I should probably put in a soft-filter option (just recalibrate GQ, don't set GT to no-call)
- Probably the output should be called something other than GQ. Phred-scaling is a bad match to probabilities near 50%, but people expect GQ to be Phred-scaled.
- Many of the default values are set at non-optimal values. I didn't want to rebuild the docker image each time I tweaked values, so those were tweaked from WDL settings instead. They should be set to something resembling "optimal" before final merge.
Github actions tests reported job failures from actions build 2616727886 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
cloud | 8 | 2616727886.10 | logs |
unit | 8 | 2616727886.1 | logs |
conda | 8 | 2616727886.3 | logs |
variantcalling | 8 | 2616727886.2 | logs |
integration | 8 | 2616727886.0 | logs |
Github actions tests reported job failures from actions build 3024497902 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
cloud | 8 | 3024497902.10 | logs |
unit | 8 | 3024497902.1 | logs |
conda | 8 | 3024497902.3 | logs |
integration | 8 | 3024497902.0 | logs |
variantcalling | 8 | 3024497902.2 | logs |
Github actions tests reported job failures from actions build 3024517679 Failures in the following jobs:
Test Type | JDK | Job ID | Logs |
---|---|---|---|
cloud | 8 | 3024517679.10 | logs |
unit | 8 | 3024517679.1 | logs |
conda | 8 | 3024517679.3 | logs |
variantcalling | 8 | 3024517679.2 | logs |
integration | 8 | 3024517679.0 | logs |