gnomad-browser
gnomad-browser copied to clipboard
Allow Gene pipeline to produce smaller test datasets
Related: #1042
Currently, in order to load data into Elasticsearch, there are a set of pipelines that need to be run that result in the full gnomad dataset being loaded to ES.
This leads to complications when creating temporary, isolated dev environments. Ideally, a subset of each of these pipelines' resulting dataset would be loaded. This would allow a developer to spin up a dev kubernetes cluster, load a smaller but representative set of data (taking less time, and using less compute than the full dataset), deploy and test changes, and create demo instances. All without touching the production gnomad deployment.
In order to more easily create dev environments to allow for testing of changes in a safer env, we should work towards modifying all the pipelines to add the option of outputting a smaller but replicable dataset.
The genes pipeline is a good place to start, as it is required to have the output of this pipeline to run other pipelines. We should add functionality to create a smaller resultant dataset including some pre-specified genes, as well as a seeded random subset of the total set of genes.
Having Genes as an example will set up the changes needed to the pipeline, as well as provide a reference for how to modify all further pipelines.