ncbi-genome-download icon indicating copy to clipboard operation
ncbi-genome-download copied to clipboard

problems downloading representative genome

Open jotech opened this issue 5 years ago • 4 comments

I'm trying to download the refseq representative genome for Vibrio lentus as it is listed here https://www.ncbi.nlm.nih.gov/genome/?term=vibrio+lentus%5Borgn%5D and it also has the coresponding refseq category http://tiny.cc/hncrdz

But when I try to download the genomes

ncbi-genome-download --dry-run -R representative --taxid 136468 bacteria
ncbi-genome-download --dry-run -R representative --genus "Vibrio lentus" bacteria
ERROR: No downloads matched your filter. Please check your options.

Besides this, ncbi-genome-download --dry-run --taxid 136468 bacteria shows me all 87 available genomes but I'm looking for the representative only. What do I miss?

jotech avatar Oct 01 '19 18:10 jotech

This appears to be a problem with the use of -R representative, but more specifically with the actual data for that entry...

If you query the assembly_summary file for that genome (wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt):

$ grep "GCF_001691195.1" assembly_summary_refseq.txt
GCF_001691195.1 PRJNA224116     SAMN04867935    MAKA00000000.1  na      136468  136468  Vibrio lentus   strain=5F79             latest  Scaffold        Major   Full    2016/07/21      ASM169119v1     Massachusetts Institute of Technology     GCA_001691195.1 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/195/GCF_001691195.1_ASM169119v1

If we compare this with an entry which has representative, there's an na where representative would be expected. I was successfully able to download the genome if providing the accession directly with -A GCA_001691195.1.

$ grep -i "representative" assembly_summary_refseq.txt | head -1
GCF_000001765.3 PRJNA18793      SAMN00779672    AADE00000000.1  representative genome   46245   7237    Drosophila pseudoobscura pseudoobscura  strain=MV2-25           latest  Chromosome      Major   Full    2013/04/11      Dpse_3.0  Baylor College of Medicine      GCA_000001765.2 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/765/GCF_000001765.3_Dpse_3.0

Perhaps Kai knows different but this appears to be an issue with the actual NCBI records?

Closer inspection of the summary file shows that all 87 genomes have na for that column. I don't think this is something this tool will be able to help you with in which case.

jrjhealey avatar Oct 01 '19 22:10 jrjhealey

thanks for your answer!

The missing representative tag is really strange because it is actually there in the source assembly report:

curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/195/GCF_001691195.1_ASM169119v1/GCF_001691195.1_ASM169119v1_assembly_report.txt | grep "RefSeq category"
# RefSeq category: Representative Genome

It seems there are inconsistent assembly reports?

jotech avatar Oct 02 '19 12:10 jotech

That’s certainly how I would interpret that. There may be a good reason for the nas in the assembly summary, but if there are, i dont know what they are!

I think it would be worth contacting NCBI over this though in case it is a mistake.

jrjhealey avatar Oct 03 '19 06:10 jrjhealey

I concur with @jrjhealey, I think this is just an issue on the NCBI side of things.

kblin avatar Oct 03 '19 07:10 kblin