hail
hail copied to clipboard
[query] Feature Request: hl.vds.sample_qc_agg
What happened?
gnomAD team asks:
We would like to get these same sample stats broken down by different variant stratifications, so essentially, this is like what we do for the frequencies, but this is for samples. With Frequencies we can use the hl.agg.filter with hl.agg.call_stats to get what we want, but there isn’t really an equivalent for samples since hl.sample_qc and hl.vds.sample_qc take MTs and return HTs rather than taking expressions and returning aggregation expressions. Would the Hail team have the bandwidth in the next couple weeks to put in a modification to the hl.vds.sample_qc to make it function similar to the hl.agg.call_stats ?
My thoughts on it:
It's a reasonable ask. You'll have four parts (first three are aggregators): rmt_sq, vmt_sq, ac_and_atype, and combine. The user must ensure the same filters, groups, etc. are applied to each aggregator. If you group, you'll need to ensure the right grouped AC is passed around. It might look like this on the variant matrix table. The reference stuff looks similar.
vmt = vmt.annotate_entries(GT=hl.vds.lgt_to_gt(vmt.LGT, vmt.LA))
vmt = vmt.annotate_rows(ac_atype=hl.agg.group_by(foo, ac_and_atype(vmt.GT, vmt.alleles)))
vmt = vmt.annotate_cols(
qc=hl.agg.group_by(
foo,
vmt_sample_qc(
global_gt=vmt.GT,
gq=vmt.GQ,
ac=ac_atype[foo].ac,
atype=ac_atype[foo].atype,
dp=vmt.DP)
)
)
The atype needn't really be grouped. That should maybe be its own aggregator and then you can use stats directly to get a valid AC.
I sketched this here: https://github.com/hail-is/hail/compare/main...danking:hail:agg-sample-qc but there are probably bugs in that. This issue is complete when we merge a PR that:
- Exposes one or more aggregators which compute all the statistics from
hl.vds.sample_qc
on the component matrix tables of a VDS: the variant data and the reference data. - Exposes a combination function which combines reference and variant stats to produce
bases_over_dp_threshold
andbases_over_gq_threshold
. - Includes extensive tests comparing these aggregators to the original
hl.vds.sample_qc
. - Includes clear documentation with several examples of how to use these new aggregators and the combination function.
Version
0.2.127
Relevant log output
No response
Hail team suggests this is a good opportunity to work together with gnomAD team to transfer knowledge on how to build these kinds of aggregators. I will talk to gnomAD team and see if there's interest.
Daniel Marten will collaborate with Chris Vittal on this project. Daniel Marten will reach out to find a time to meet with Chris. Goal is to pair program to implement an aggregator-version of hl.vds.sample_qc
. Secondary goal is to transmit some understanding of how to build these kinds of aggregators in general so that gnomAD team can feel confident writing custom aggregators too!
Will be fun! Planning on meeting with Chris Vittal on this project today, Tuesday Feb 13th. Let me know if you'd like me to drop my contact info in the ticket or assign me or anything.
Implemented in #14297