hail icon indicating copy to clipboard operation
hail copied to clipboard

[query] Feature Request: hl.vds.sample_qc_agg

Open danking opened this issue 4 months ago • 3 comments

What happened?

gnomAD team asks:

We would like to get these same sample stats broken down by different variant stratifications, so essentially, this is like what we do for the frequencies, but this is for samples. With Frequencies we can use the hl.agg.filter with hl.agg.call_stats to get what we want, but there isn’t really an equivalent for samples since hl.sample_qc and hl.vds.sample_qc take MTs and return HTs rather than taking expressions and returning aggregation expressions. Would the Hail team have the bandwidth in the next couple weeks to put in a modification to the hl.vds.sample_qc to make it function similar to the hl.agg.call_stats ?

My thoughts on it:

It's a reasonable ask. You'll have four parts (first three are aggregators): rmt_sq, vmt_sq, ac_and_atype, and combine. The user must ensure the same filters, groups, etc. are applied to each aggregator. If you group, you'll need to ensure the right grouped AC is passed around. It might look like this on the variant matrix table. The reference stuff looks similar.

vmt = vmt.annotate_entries(GT=hl.vds.lgt_to_gt(vmt.LGT, vmt.LA))
vmt = vmt.annotate_rows(ac_atype=hl.agg.group_by(foo, ac_and_atype(vmt.GT, vmt.alleles)))
vmt = vmt.annotate_cols(
    qc=hl.agg.group_by(
        foo,
        vmt_sample_qc(
            global_gt=vmt.GT,
            gq=vmt.GQ,
            ac=ac_atype[foo].ac,
            atype=ac_atype[foo].atype,
            dp=vmt.DP)
    )
)

The atype needn't really be grouped. That should maybe be its own aggregator and then you can use stats directly to get a valid AC.

I sketched this here: https://github.com/hail-is/hail/compare/main...danking:hail:agg-sample-qc but there are probably bugs in that. This issue is complete when we merge a PR that:

  1. Exposes one or more aggregators which compute all the statistics from hl.vds.sample_qc on the component matrix tables of a VDS: the variant data and the reference data.
  2. Exposes a combination function which combines reference and variant stats to produce bases_over_dp_threshold and bases_over_gq_threshold.
  3. Includes extensive tests comparing these aggregators to the original hl.vds.sample_qc.
  4. Includes clear documentation with several examples of how to use these new aggregators and the combination function.

Version

0.2.127

Relevant log output

No response

danking avatar Feb 07 '24 19:02 danking