datasets Include corresponding gca/gcf accession in the metadata of each sample

Is your feature request related to a problem? Please describe.

Thanks again for this amazing tool!

I download all influenza A sequences using

datasets download virus genome taxon 11320  --filename data.zip

I would like to group all influenza A segments that come from the same assembly/isolate. I have been using the isolate metadata field for this purpose but it is a work-around and not always available. I noticed that I could download all assemblies to see which samples are in an assembly (sadly this information is not included in the assembly summary field). But I'm having issues with a download of this size. Having the gca accession in the sample metadata (data_report.jsonl) would really simplify this process and give me greater certainty that I have grouped segments together correctly.

Describe the solution you'd like A clear and concise description of what you want to happen. Each assembly contains =<8 influenza segments that were sequenced together. Each assembly has a unique ID. For each nucleotide sequence that is in an assembly I would like to see the gca/gfa accession as a metadata field in the data_report.jsonl so I could use it for grouping samples.

Thank you

Thanks for your feedback--your feature requests help improve NCBI Datasets.

Dec 05 '24 18:12 anna-parker

Hi anna-parker,

Thank you for the great suggestion! I’ve created a ticket for this feature request and will work on it soon.

All the best,

Nuala

Dec 05 '24 20:12 olearyna

Hey! Just checking in if this is still to be worked on soon, we still need it!

Nov 27 '25 10:11 fhennig

Perhaps an alternative to downloading the full assembly is to download the summary

Get a summary of all assembly accessions for taxon

datasets summary genome taxon 11320 --as-json-lines --report ids_only \
  | dataformat tsv genome --fields accession --elide-header > data/11320-genome-accessions.tsv

Use assembly accessions to get a summary of all linked GenBank accessions

datasets summary genome accession --as-json-lines --inputfile data/11320-genome-accessions.tsv --report sequence \
  | dataformat tsv genome-seq --fields accession,genbank-seq-acc > data/11320-genome-genbank-accessions.tsv

I have not tested speed, but definitely smaller files

Dec 12 '25 18:12 joverlee521

Oh, I missed that the --report sequence option is available for taxon too. So this can just be a single step

datasets summary genome taxon 11320 --as-json-lines --report sequence \
  | dataformat tsv genome-seq --fields accession,genbank-seq-acc > data/11320-genome-genbank-accessions.tsv

Dec 12 '25 20:12 joverlee521

Oh thanks for that! I gave it a try, it's definitely nice to not download all the sequence data when we don't need it. But it looks like datasets is still doing one request per assembly, so we're not gaining that much speed actually. But still worth considering the switch!

Dec 15 '25 10:12 fhennig