gnomad-browser icon indicating copy to clipboard operation
gnomad-browser copied to clipboard

Update ClinVar pipeline

Open rileyhgrant opened this issue 8 months ago • 2 comments

ClinVar is an external data source that aggregates and curates information about the impacts variants have. Their dataset updates frequently, and the gnomAD browser aims to keep the ClinVar information displayed up to date -- currently we aim for monthly updates, though ClinVar has weekly releases.

Currently, we have two pipelines that each use a dataproc cluster that we create with different versions of VEP (the Variant Effect Predictor), v.85 for GRCh37, and v.105 for GRCh38. Each of these clusters gets spun up with 32 secondary workers, as vep is computationally expensive, and additional workers helps complete these tasks in a reasonable amount of time.

There are a few things that need to change, and could change about these pipelines.

Need to change:

  1. ClinVar has updated the format of its XML release. It has already done this, and is continuing to support the old format through ~the end of this month (June 2024). After this month, our pipeline step that parses the XML input will not work. This must be updated.

    ClinVar also has a regular VCF release, as best I can tell, this does not fit our needs due to this release not including all of the variants, and not including information about the individual ClinVar submissions. The XML, in contrast, contains all this information.

  2. Hail has no longer supports our region for hailctls built in --vep argument. Due to multi region bucket egress costs, they moved to supporting specific regions, most relevantly us-central1, to allow usage of their built in helper. All of our infrastructure is in us-east1 which is not supported. This is only a problem for GRCh37, as our GRCh38 pipeline uses a custom install script. We could either move towards a custom install script for GRCh37, and/or talk with the Hail team about supporting us-east1.

Could change:

  1. The first step to parse the XML takes a long time due to only leveraging a single worker, and is not fault tolerant.

    The first step of this pipeline ultimately parses the weekly ClinVar xml release and turns it into a hail table, which the rest of the pipeline acts on. The clusters we create historically have a large number of secondary workers that sit around idle for the first ~50 minutes while the single main worker chews through the XML. We could instead turn this into three steps, one to chew through the XML, then one to vep and annotate the resultant hailtable for GRCh37 and GRCh38. Alternately / additionally, we could work on making this step leverage multiple workers to perform this parsing of the XML more quickly.

    This step is not fault tolerant -- if a new classification or category is introduced, or a new bit of malformed XML is introduced in the input ClinVar weekly XML release (neither of which are uncommon) our pipeline crashes and loses the work it has done so far. I see arguments for both sides here, on one hand, its nice to know that something has been introduced that is unexpected, and to explicitly catch this new error, on the other hand, crashing 45 minutes into the running of the pipeline is maybe not ideal.

  2. Due to not fully pinned dependencies, it is difficult to get the exact same pipeline to run twice. This will largely, if not completely, be resolved by moving to a model of spinning up a dataproc cluster with a particular image. For now, pinning hail in the requirements ends up pinning all of hails dependencies in a transitive way -- since we call hailctl dataproc ..., the gcloud command that hailctl calls has a list of pinned dependencies of hail passed to the dataproc cluster.

rileyhgrant avatar Jun 14 '24 14:06 rileyhgrant