PheKnowLator icon indicating copy to clipboard operation
PheKnowLator copied to clipboard

CI/CD Pipeline: Ensuring Builds Use Most Current Data

Open callahantiff opened this issue 3 years ago • 3 comments

TASK

Currently, the build downloads are via the builds/data_to_download.txt, which is a list of URLs. While this will work for 90% of the existing data used, there are a few data provides that include explicit versions in the URLs. As of now, this means that unless we update this text file we will not be guaranteed to get the most current data. Additionally, some of the downloads rely on running a query against a data provider's API. This should always result in the most up-to-date data, but we should verify this also.

The following resources include explicit versions in the URLs and will need updates to resolve the aforementioned problem:

  • ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz ➞ Ensembl
  • ftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.uniprot.tsv.gz ➞ Ensembl
  • ftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.entrez.tsv.gz ➞ Ensembl
  • ftp://nlmpubs.nlm.nih.gov/online/mesh/rdf/2021/mesh2021.nt ➞ MeSH
  • GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct ➞ GTeX
  • 9606.protein.links.v11.0.txt.gz ➞ STRING

The following resources are generated from querying an API:



TODO

  • [ ] Modify the download code for explicitly versioned URLs to ensure that we are always getting the most updated data
  • [ ] Verify that resources downloaded via API queries will also return the most updated results

callahantiff avatar Feb 05 '21 16:02 callahantiff

check out the bioversions project, I'm working on similar stuff for solving this problem... unfortunately the state of versioned biomedical data is just as lacking as most other things 🤡

cthoyt avatar Feb 15 '21 22:02 cthoyt

@cthoyt - brilliant, yes! Will definitely work on this for upcoming releases. Thanks for pointing this out!

callahantiff avatar Feb 16 '21 22:02 callahantiff

@callahantiff please let me know if there are any resources you're using that aren't supported by bioversions already and I will add them. The syntax to get the current version for one is:

import bioversions
version_string = bioversions.get_version('resource name')

cthoyt avatar Feb 23 '21 14:02 cthoyt