sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

how to handle suppressed records in databases?

Open bluegenes opened this issue 2 years ago • 5 comments

GCA_905332505.2 is part of gtdb-rs207 (https://gtdb.ecogenomic.org/genome?gid=GCA_905332505.2), but has been suppressed (see https://www.ncbi.nlm.nih.gov/assembly/GCA_905332505.2).

Genome/proteome download from NCBI fails (due to suppression).

Since wort sketches files as they become available, I believe we had genomic signatures available to include in our database. We do not have the same luxury for our protein database.

If we use the same taxonomy file between genome and proteome databases, there will be a "missing" identifier in the protein database. I think this might affect taxonomy functions?

I'm sure this won't be the only time this happens -- would be nice to handle this sort of case safely.

bluegenes avatar May 05 '22 17:05 bluegenes

https://github.com/kblin/ncbi-genome-download/issues/138

It seems others have had this issue as well. I can't find the assembly_summary_historical.txt file suggested to have download information

taylorreiter avatar May 26 '22 16:05 taylorreiter

As a species representative, this genome will be downloadable from the GTDB ftp

So I guess we download the whole thing...and just take the one little genome we want?'

https://twitter.com/apcamargo_/status/1529881238164492289?s=20&t=aAe7UmO9hp3tVZgbyeebAw

data.ace.uq.edu.au/public/gtdb/data/releases/release207

taylorreiter avatar May 26 '22 21:05 taylorreiter

kblin/ncbi-genome-download#138

It seems others have had this issue as well. I can't find the assembly_summary_historical.txt file suggested to have download information

I think name changed to assembly_summary_genbank_historical.txt

luizirber avatar May 27 '22 19:05 luizirber

this genome on farm: /home/tereiter/gtdb_genomes_reps_rs207/gtdb_genomes_reps_r207/GCA/905/332/505/GCA_905332505.2_genomic.fna.gz

bluegenes avatar Jun 01 '22 03:06 bluegenes

Running into something similar here; I have 131 samples, altogether ~950 species identified across these using SM gather, 188 of which have been suppressed from ncbi for various reasons. That's a big chunk! I used gtdb-rs214-reps.

Is it because these were suppressed after the db was prepared? At this point, if I need to avoid this because I need to fetch the genomes of species I find in my sample, would the best solution be to create this reference database myself?

jorondo1 avatar Jan 30 '24 15:01 jorondo1