datasets
datasets copied to clipboard
--include none and --chromosomes all
Hello,
I would like to use the --chromosomes all
option when I download a genome to only get the chromosomes. I noticed that using this option also automatically downloads the complete genome fasta file (I think because --include genome
appears to be the default. For example, when I run this command: datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip
, these are the resulting files:
Archive: TEST.zip
inflating: README.md
inflating: ncbi_dataset/data/assembly_data_report.jsonl
inflating: ncbi_dataset/data/GCA_940337035.1/GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr1.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr2.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr3.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr4.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr5.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr6.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr7.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr8.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr9.fna
inflating: ncbi_dataset/data/GCA_940337035.1/chr10.fna
inflating: ncbi_dataset/data/GCA_940337035.1/unplaced.scaf.fna
I do not want to download GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna
.
I thought that trying --chromosomes all --include none
would allow me to download the fasta files of just the scaffolds designated as chromosomes, but it doesn't download any sequence.
Do you have any suggestions on how to download just the chromosome scaffolds without having to filter based on the info in the sequence report? I am using datasets v15.29.0
Thank you! Darrin
Hi @conchoecia,
Thanks for opening this issue.
I noticed that using this option also automatically downloads the complete genome fasta file
This is a bug. We will try to fix this soon. In the meantime, I suggest that you try the following to only download the chromosome sequences:
- Download a dehydrated package
datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip --dehydrated
- Unzip the downloaded package
unzip TEST.zip -d TEST
- Rehydrate the extracted package, using
--match
to selectively download filenames that include "chr"datasets rehydrate --directory TEST --match chr
Thanks again for opening this issue. I'll comment on this thread when we have a bug fix ready.
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]
Hi @ericcox1,
This solution works well - thanks! I will adjust my scripts to do this instead of parsing the sequence report .json file.
-Darrin
Update: I found that doing this process pulls scaffolds that are known to be localized to specific chromosomes, but are not actually placed.
For example, there is a bird genome, GCA_027574665.1
, that has named chromosomes with the properties {"assignedMoleculeLocationType":"Chromosome", "role":"assembled-molecule"}
. It also has unplaced pieces that are known to be on a specific chromosome, but are unplaced. These scaffolds are all less than 1Mbp, and have the properties {"assignedMoleculeLocationType":"Chromosome", "role":"unlocalized-scaffold"}
. I'm not sure yet if I want to exclude the second type for my analysis, but this would be a good reason to parse the seq-report from datasets download genome accession GCA_027574665.1 --include seq-report
Hi @ericcox1,
I identified a place where this breaks - for some assemblies, rehydrating still downloads the entire genome assembly fasta file, in addition to the chromosome-scale scaffolds as individual files as I requested.
Here is a minimal example that uses the latest release of datasets:
#!/bin/bash
# For the genome assembly, GCA_933207985.1, it appears like downloading the chromosome-scale scaffolds resulted in two errors
# - The first error is that all of the chromosome-scale scaffolds downloaded twice.
# - The second error is that all of the non-chromosome-scale scaffolds downloaded more than once.
ASSEMBLY=GCA_933207985.1
# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets
# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr
The resulting files are:
./TEST/ncbi_dataset/data/GCA_933207985.1/chr05.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr02.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr13.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr14.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr03.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr04.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr12.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr11.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr09.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr07.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr10.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr01.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr06.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/GCA_933207985.1_aPelCul1.1_chrom_genomic.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr08.fna
However, the file ./TEST/ncbi_dataset/data/GCA_933207985.1/GCA_933207985.1_aPelCul1.1_chrom_genomic.fna
should not be present, based on how I've seen how the example works with other Assembly Accessions. I am not sure if this happened for more than one accession or not. Thanks!
I found another place where this breaks. Some assemblies, despite having chromosome-scale scaffolds, have the error 'Found no files for rehydration' after running this. The assembly that I found that causes this error was GCF_905220415.1
.
Here is the minimal example:
#!/bin/bash
# For the record GCF_905220415.1, there is some problem where the final fasta file is empty when using this method.
# Closer inspection reveals that the database correctly identifies certain scaffolds as being chromosome-scale, but
# they are not downloaded correctly
ASSEMBLY=GCF_905220415.1
# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets
# Check if there are chromosome-scale scaffolds
./datasets summary genome accession ${ASSEMBLY} --report sequence --as-json-lines | grep 'Chromosome' | head -5
# remove old files from previous runs
rm -rf TEST/ TEST.zip
# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr
Here are the results of running the above script, showing that there are chromosome-scale scaffolds, but the rehydration did not work.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 17.5M 100 17.5M 0 0 5424k 0 0:00:03 0:00:03 --:--:-- 5424k
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"1","gc_count":"5143305","gc_percent":34,"genbank_accession":"HG991959.1","length":15086434,"refseq_accession":"NC_059537.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"2","gc_count":"4472954","gc_percent":34,"genbank_accession":"HG991960.1","length":13248411,"refseq_accession":"NC_059538.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"3","gc_count":"4471753","gc_percent":34,"genbank_accession":"HG991961.1","length":13170806,"refseq_accession":"NC_059539.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"4","gc_count":"4360384","gc_percent":34,"genbank_accession":"HG991962.1","length":12846590,"refseq_accession":"NC_059540.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"5","gc_count":"4238058","gc_percent":33.5,"genbank_accession":"HG991963.1","length":12694599,"refseq_accession":"NC_059541.1","role":"assembled-molecule"}
Collecting 1 genome record [================================================] 100% 1/1
Downloading: TEST.zip 3.98kB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4
Archive: TEST.zip
inflating: TEST/README.md
inflating: TEST/ncbi_dataset/data/assembly_data_report.jsonl
inflating: TEST/ncbi_dataset/fetch.txt
inflating: TEST/ncbi_dataset/data/dataset_catalog.json
Found no files for rehydration