ncbi-genome-download
ncbi-genome-download copied to clipboard
problems downloading representative genome
I'm trying to download the refseq representative genome for Vibrio lentus
as it is listed here
https://www.ncbi.nlm.nih.gov/genome/?term=vibrio+lentus%5Borgn%5D
and it also has the coresponding refseq category
http://tiny.cc/hncrdz
But when I try to download the genomes
ncbi-genome-download --dry-run -R representative --taxid 136468 bacteria
ncbi-genome-download --dry-run -R representative --genus "Vibrio lentus" bacteria
ERROR: No downloads matched your filter. Please check your options.
Besides this, ncbi-genome-download --dry-run --taxid 136468 bacteria
shows me all 87 available genomes but I'm looking for the representative only.
What do I miss?
This appears to be a problem with the use of -R representative
, but more specifically with the actual data for that entry...
If you query the assembly_summary
file for that genome (wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
):
$ grep "GCF_001691195.1" assembly_summary_refseq.txt
GCF_001691195.1 PRJNA224116 SAMN04867935 MAKA00000000.1 na 136468 136468 Vibrio lentus strain=5F79 latest Scaffold Major Full 2016/07/21 ASM169119v1 Massachusetts Institute of Technology GCA_001691195.1 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/195/GCF_001691195.1_ASM169119v1
If we compare this with an entry which has representative
, there's an na
where representative
would be expected. I was successfully able to download the genome if providing the accession directly with -A GCA_001691195.1
.
$ grep -i "representative" assembly_summary_refseq.txt | head -1
GCF_000001765.3 PRJNA18793 SAMN00779672 AADE00000000.1 representative genome 46245 7237 Drosophila pseudoobscura pseudoobscura strain=MV2-25 latest Chromosome Major Full 2013/04/11 Dpse_3.0 Baylor College of Medicine GCA_000001765.2 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/765/GCF_000001765.3_Dpse_3.0
Perhaps Kai knows different but this appears to be an issue with the actual NCBI records?
Closer inspection of the summary file shows that all 87 genomes have na
for that column. I don't think this is something this tool will be able to help you with in which case.
thanks for your answer!
The missing representative
tag is really strange because it is actually there in the source assembly report:
curl -s ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/195/GCF_001691195.1_ASM169119v1/GCF_001691195.1_ASM169119v1_assembly_report.txt | grep "RefSeq category"
# RefSeq category: Representative Genome
It seems there are inconsistent assembly reports?
That’s certainly how I would interpret that. There may be a good reason for the na
s in the assembly summary, but if there are, i dont know what they are!
I think it would be worth contacting NCBI over this though in case it is a mistake.
I concur with @jrjhealey, I think this is just an issue on the NCBI side of things.