multitax
multitax copied to clipboard
Changed GTDB metadata naming and format
I was attempting to map from NCBI to GTDB taxonomy, when building translation multitax was unable to download GTDB metadata
from multitax import GtdbTx, NcbiTx
ncbi = NcbiTx()
gtdb = GtdbTx()
ncbi.build_translation(gtdb)
Exception: One or more files could not be downloaded: https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tar.gz, https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tar.gz
For r214.1, the metadata is no longer a tarball, appears to be a gzipped tsv: bac120_metadata.tsv.gz, ar53_metadta.tsv.gz. Looks like it would need some different handling in build_translation as well as that extract tar members.
I'd be happy to put together a pull request to fix, if you're interested.
Thanks for reporting. Indeed they changed a while ago. A PR would be great! You have to update the urls and the parsing procedure, the download_files function should be generalized for the gzip only files. Some day ago I fixed this exact bug in another tool, you can use it as an example.
Okay great, will take a look!