gnomad_methods
gnomad_methods copied to clipboard
Block gzipped GENCODE files
The GENCODE GTF files associated with gnomAD annotations are occasionally useful. For example, they are needed to get the gene and transcript version numbers for VEP annotations for Ensembl transcripts. Or they can be used to get an interval for a particular gene or transcript, which can then be used to efficiently filter the variants Hail tables.
However, the files hosted by GENCODE are not block gzipped. Thus, they are slow to import into Hail because the import cannot be parallelized. To make working with this data in Hail easier, it would be nice if block gzipped versions of the relevant GENCODE files were available in the gnomAD public buckets.
https://www.gencodegenes.org/human/releases.html https://gnomad.broadinstitute.org/help/what-version-of-gencode-was-used-to-annotate-variants
Hail has block gzipped versions of GENCODE v19 and v29 in gs://hail-common/references/gencode/
. Those versions match the versions of VEP that hailctl dataproc
provides. They are also used for hail.experimental.get_gene_intervals.
Looks like Hail also has a few versions of GENCODE (v19 and v31) in their datasets library. Maybe they could add v35.