multitax icon indicating copy to clipboard operation
multitax copied to clipboard

Changed GTDB metadata naming and format

Open apduncan opened this issue 1 year ago • 2 comments

I was attempting to map from NCBI to GTDB taxonomy, when building translation multitax was unable to download GTDB metadata

from multitax import GtdbTx, NcbiTx

ncbi = NcbiTx()
gtdb = GtdbTx()

ncbi.build_translation(gtdb)

Exception: One or more files could not be downloaded: https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tar.gz, https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tar.gz

For r214.1, the metadata is no longer a tarball, appears to be a gzipped tsv: bac120_metadata.tsv.gz, ar53_metadta.tsv.gz. Looks like it would need some different handling in build_translation as well as that extract tar members.

I'd be happy to put together a pull request to fix, if you're interested.

apduncan avatar Feb 27 '24 09:02 apduncan

Thanks for reporting. Indeed they changed a while ago. A PR would be great! You have to update the urls and the parsing procedure, the download_files function should be generalized for the gzip only files. Some day ago I fixed this exact bug in another tool, you can use it as an example.

pirovc avatar Feb 27 '24 16:02 pirovc

Okay great, will take a look!

apduncan avatar Feb 27 '24 16:02 apduncan