gnomad-browser icon indicating copy to clipboard operation
gnomad-browser copied to clipboard

Use GA4GH VRS identifier as Elasticsearch document ID for variants?

Open nawatts opened this issue 3 years ago • 1 comments

To avoid the possibility of indexing multiple copies of a document in Elasticsearch, each document must have a unique ID.

Currently, the document ID used for variants is a compressed form of the chrom-pos-ref-alt variant ID.

https://github.com/broadinstitute/gnomad-browser/blob/95f5d24f540f8132dc9cf546226ae9581bb095bb/data-pipeline/src/data_pipeline/pipelines/export_to_elasticsearch.py#L44-L45

https://github.com/broadinstitute/gnomad-browser/blob/95f5d24f540f8132dc9cf546226ae9581bb095bb/data-pipeline/src/data_pipeline/data_types/variant/variant_id.py#L68-L94

The variant ID itself cannot be used because ES document IDs are limited to 512 bytes.

https://www.elastic.co/guide/en/elasticsearch/reference/7.x/mapping-id-field.html

However, the technique currently used is not guaranteed to work with longer variant IDs from larger indels.

GA4GH VRS identifiers don't have this problem.

https://vrs.ga4gh.org/en/latest/impl-guide/computed_identifiers.html

nawatts avatar Jan 03 '22 22:01 nawatts

Related to #658.

nawatts avatar Jan 03 '22 22:01 nawatts