datasets --include none and --chromosomes all

--include none and --chromosomes all

Open conchoecia opened this issue 6 months ago • 4 comments

Hello,

I would like to use the --chromosomes all option when I download a genome to only get the chromosomes. I noticed that using this option also automatically downloads the complete genome fasta file (I think because --include genome appears to be the default. For example, when I run this command: datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip, these are the resulting files:

Archive:  TEST.zip
  inflating: README.md
  inflating: ncbi_dataset/data/assembly_data_report.jsonl
  inflating: ncbi_dataset/data/GCA_940337035.1/GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr1.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr2.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr3.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr4.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr5.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr6.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr7.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr8.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr9.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr10.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/unplaced.scaf.fna

I do not want to download GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna.

I thought that trying --chromosomes all --include none would allow me to download the fasta files of just the scaffolds designated as chromosomes, but it doesn't download any sequence.

Do you have any suggestions on how to download just the chromosome scaffolds without having to filter based on the info in the sequence report? I am using datasets v15.29.0

Thank you! Darrin

Dec 19 '23 15:12 conchoecia

Hi @conchoecia,

Thanks for opening this issue.

I noticed that using this option also automatically downloads the complete genome fasta file

This is a bug. We will try to fix this soon. In the meantime, I suggest that you try the following to only download the chromosome sequences:

Download a dehydrated package datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip --dehydrated
Unzip the downloaded package unzip TEST.zip -d TEST
Rehydrate the extracted package, using --match to selectively download filenames that include "chr" datasets rehydrate --directory TEST --match chr

Thanks again for opening this issue. I'll comment on this thread when we have a bug fix ready.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]

Dec 19 '23 19:12 ericcox1

Hi @ericcox1,

This solution works well - thanks! I will adjust my scripts to do this instead of parsing the sequence report .json file.

-Darrin

Update: I found that doing this process pulls scaffolds that are known to be localized to specific chromosomes, but are not actually placed.

For example, there is a bird genome, GCA_027574665.1, that has named chromosomes with the properties {"assignedMoleculeLocationType":"Chromosome", "role":"assembled-molecule"}. It also has unplaced pieces that are known to be on a specific chromosome, but are unplaced. These scaffolds are all less than 1Mbp, and have the properties {"assignedMoleculeLocationType":"Chromosome", "role":"unlocalized-scaffold"}. I'm not sure yet if I want to exclude the second type for my analysis, but this would be a good reason to parse the seq-report from datasets download genome accession GCA_027574665.1 --include seq-report

Dec 20 '23 12:12 conchoecia

Hi @ericcox1,

I identified a place where this breaks - for some assemblies, rehydrating still downloads the entire genome assembly fasta file, in addition to the chromosome-scale scaffolds as individual files as I requested.

Here is a minimal example that uses the latest release of datasets:

#!/bin/bash

# For the genome assembly, GCA_933207985.1, it appears like downloading the chromosome-scale scaffolds resulted in two errors
#  - The first error is that all of the chromosome-scale scaffolds downloaded twice.
#  - The second error is that all of the non-chromosome-scale scaffolds downloaded more than once.

ASSEMBLY=GCA_933207985.1

# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets

# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr

The resulting files are:

./TEST/ncbi_dataset/data/GCA_933207985.1/chr05.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr02.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr13.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr14.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr03.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr04.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr12.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr11.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr09.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr07.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr10.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr01.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr06.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/GCA_933207985.1_aPelCul1.1_chrom_genomic.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr08.fna

However, the file ./TEST/ncbi_dataset/data/GCA_933207985.1/GCA_933207985.1_aPelCul1.1_chrom_genomic.fna should not be present, based on how I've seen how the example works with other Assembly Accessions. I am not sure if this happened for more than one accession or not. Thanks!

Dec 30 '23 17:12 conchoecia

I found another place where this breaks. Some assemblies, despite having chromosome-scale scaffolds, have the error 'Found no files for rehydration' after running this. The assembly that I found that causes this error was GCF_905220415.1.

Here is the minimal example:

#!/bin/bash

# For the record GCF_905220415.1, there is some problem where the final fasta file is empty when using this method.
# Closer inspection reveals that the database correctly identifies certain scaffolds as being chromosome-scale, but
#  they are not downloaded correctly

ASSEMBLY=GCF_905220415.1

# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets

# Check if there are chromosome-scale scaffolds
./datasets summary genome accession ${ASSEMBLY} --report sequence --as-json-lines | grep 'Chromosome' | head -5

# remove old files from previous runs
rm -rf TEST/ TEST.zip
# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr

Here are the results of running the above script, showing that there are chromosome-scale scaffolds, but the rehydration did not work.

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.5M  100 17.5M    0     0  5424k      0  0:00:03  0:00:03 --:--:-- 5424k
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"1","gc_count":"5143305","gc_percent":34,"genbank_accession":"HG991959.1","length":15086434,"refseq_accession":"NC_059537.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"2","gc_count":"4472954","gc_percent":34,"genbank_accession":"HG991960.1","length":13248411,"refseq_accession":"NC_059538.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"3","gc_count":"4471753","gc_percent":34,"genbank_accession":"HG991961.1","length":13170806,"refseq_accession":"NC_059539.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"4","gc_count":"4360384","gc_percent":34,"genbank_accession":"HG991962.1","length":12846590,"refseq_accession":"NC_059540.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"5","gc_count":"4238058","gc_percent":33.5,"genbank_accession":"HG991963.1","length":12694599,"refseq_accession":"NC_059541.1","role":"assembled-molecule"}
Collecting 1 genome record [================================================] 100% 1/1
Downloading: TEST.zip    3.98kB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4
Archive:  TEST.zip
  inflating: TEST/README.md
  inflating: TEST/ncbi_dataset/data/assembly_data_report.jsonl
  inflating: TEST/ncbi_dataset/fetch.txt
  inflating: TEST/ncbi_dataset/data/dataset_catalog.json
Found no files for rehydration

Dec 31 '23 09:12 conchoecia

datasets datasets copied to clipboard

--include none and --chromosomes all

datasets
datasets copied to clipboard