Include corresponding gca/gcf accession in the metadata of each sample
Is your feature request related to a problem? Please describe.
Thanks again for this amazing tool!
I download all influenza A sequences using
datasets download virus genome taxon 11320 --filename data.zip
I would like to group all influenza A segments that come from the same assembly/isolate. I have been using the isolate metadata field for this purpose but it is a work-around and not always available. I noticed that I could download all assemblies to see which samples are in an assembly (sadly this information is not included in the assembly summary field). But I'm having issues with a download of this size. Having the gca accession in the sample metadata (data_report.jsonl) would really simplify this process and give me greater certainty that I have grouped segments together correctly.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Each assembly contains =<8 influenza segments that were sequenced together. Each assembly has a unique ID. For each nucleotide sequence that is in an assembly I would like to see the gca/gfa accession as a metadata field in the data_report.jsonl so I could use it for grouping samples.
Thank you
Thanks for your feedback--your feature requests help improve NCBI Datasets.
Hi anna-parker,
Thank you for the great suggestion! I’ve created a ticket for this feature request and will work on it soon.
All the best,
Nuala
Hey! Just checking in if this is still to be worked on soon, we still need it!
Perhaps an alternative to downloading the full assembly is to download the summary
- Get a summary of all assembly accessions for taxon
datasets summary genome taxon 11320 --as-json-lines --report ids_only \
| dataformat tsv genome --fields accession --elide-header > data/11320-genome-accessions.tsv
- Use assembly accessions to get a summary of all linked GenBank accessions
datasets summary genome accession --as-json-lines --inputfile data/11320-genome-accessions.tsv --report sequence \
| dataformat tsv genome-seq --fields accession,genbank-seq-acc > data/11320-genome-genbank-accessions.tsv
I have not tested speed, but definitely smaller files
Oh, I missed that the --report sequence option is available for taxon too. So this can just be a single step
datasets summary genome taxon 11320 --as-json-lines --report sequence \
| dataformat tsv genome-seq --fields accession,genbank-seq-acc > data/11320-genome-genbank-accessions.tsv
Oh thanks for that! I gave it a try, it's definitely nice to not download all the sequence data when we don't need it. But it looks like datasets is still doing one request per assembly, so we're not gaining that much speed actually. But still worth considering the switch!