gnomad-browser icon indicating copy to clipboard operation
gnomad-browser copied to clipboard

Document schema/fields in browser release and gene model HTs

Open ch-kr opened this issue 1 year ago • 3 comments

Hi all, just following up on today's browser meeting: would it be possible for someone on the team to document the fields present in the browser release HT (not sure what you call this on your team, but the combined version of the exomes and genomes sites HTs our team sends for release) and the gene model HT? Here is an example of how we documented the schema for the v4 exomes release and the v4 HT Help page.

ch-kr avatar Jan 04 '24 20:01 ch-kr

Here's a link to a google doc containing the marked up schema that documents the shape of the hail tables and the meaning of each of the fields.

https://docs.google.com/document/d/1zP5yErlmoNHOL3HhdUVuBbNCZskjaCysZFrZE7uAqbs/edit?usp=sharing

rileyhgrant avatar Feb 08 '24 21:02 rileyhgrant

thank you for creating this document! I've added comments and suggestions.

One higher level question for this schema: the site quality metrics histograms displayed on the variant pages display adj metrics, right? (Metrics calculated using only high-quality genotypes. The frequencies we display on the browser are all adj filtered). If yes, then you shouldn't need to the raw qual hists in the browser table, since they don't get loaded

ch-kr avatar Feb 14 '24 18:02 ch-kr

thanks to Riley for sharing the code used to create these tables (also sharing here to track for future reference)

  • https://github.com/broadinstitute/gnomad-browser/blob/main/data-pipeline/src/data_pipeline/pipelines/gnomad_v4_variants.py

  • https://github.com/broadinstitute/gnomad-browser/blob/main/data-pipeline/src/data_pipeline/datasets/gnomad_v4/gnomad_v4_variants.py

  • https://github.com/broadinstitute/gnomad-browser/blob/main/data-pipeline/src/data_pipeline/pipelines/genes.py

  • https://github.com/broadinstitute/gnomad-browser/blob/main/data-pipeline/src/data_pipeline/data_types/gene.py

I have a couple questions about these two tables:

  • It looks like there are some checks put in place to validate the gnomad variants table. From a quick glance, the checks look largely like formatting checks (schema, checking for malformed data) -- is that correct? Are there any other checks that I've missed?
  • Are there any checks for the gene model table?

I also have one comment about the gene model table (cc @mattsolo1): it seems like the GRCh37 version of this table should be stable we shouldn't be updating it), so releasing that one sounds good to me. Given that GRCh38 constraint is experimental, however, I vote we remove all constraint annotations from this table prior to public release. We can add them to the table in the future after we've made more updates and simply overwrite the existing resource. What do you think?

ch-kr avatar Feb 23 '24 15:02 ch-kr