sourmash
sourmash copied to clipboard
how to handle suppressed records in databases?
GCA_905332505.2
is part of gtdb-rs207
(https://gtdb.ecogenomic.org/genome?gid=GCA_905332505.2), but has been suppressed (see https://www.ncbi.nlm.nih.gov/assembly/GCA_905332505.2).
Genome/proteome download from NCBI fails (due to suppression).
Since wort
sketches files as they become available, I believe we had genomic
signatures available to include in our database. We do not have the same luxury for our protein database.
If we use the same taxonomy file between genome and proteome databases, there will be a "missing" identifier in the protein database. I think this might affect taxonomy functions?
I'm sure this won't be the only time this happens -- would be nice to handle this sort of case safely.
https://github.com/kblin/ncbi-genome-download/issues/138
It seems others have had this issue as well. I can't find the assembly_summary_historical.txt
file suggested to have download information
As a species representative, this genome will be downloadable from the GTDB ftp
So I guess we download the whole thing...and just take the one little genome we want?'
https://twitter.com/apcamargo_/status/1529881238164492289?s=20&t=aAe7UmO9hp3tVZgbyeebAw
kblin/ncbi-genome-download#138
It seems others have had this issue as well. I can't find the
assembly_summary_historical.txt
file suggested to have download information
I think name changed to assembly_summary_genbank_historical.txt
this genome on farm: /home/tereiter/gtdb_genomes_reps_rs207/gtdb_genomes_reps_r207/GCA/905/332/505/GCA_905332505.2_genomic.fna.gz
Running into something similar here; I have 131 samples, altogether ~950 species identified across these using SM gather, 188 of which have been suppressed from ncbi for various reasons. That's a big chunk! I used gtdb-rs214-reps.
Is it because these were suppressed after the db was prepared? At this point, if I need to avoid this because I need to fetch the genomes of species I find in my sample, would the best solution be to create this reference database myself?