datasets icon indicating copy to clipboard operation
datasets copied to clipboard

"Total" sequence length doesn't include organelles

Open muffato opened this issue 1 year ago • 1 comments

Before opening an issue, please:

  • [x] Make sure you are using the latest version using datasets --version
  • [x] Review our documentation

Describe the bug

Hello NCBI !

The assembly GCA_964199945.1 is reported as having a "Total Sequence Length" of 1,327,610,284 bp, but the the Fasta file actually contains 1,328,070,353 bp. The difference is exactly the MT and the plastid.

In https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_genome/:

assmstats-total-sequence-len Assembly Stats Total Sequence Length

To Reproduce

$ datasets summary genome accession GCA_964199945.1 --as-json-lines | dataformat tsv genome --fields assmstats-total-sequence-len --elide-header
1327610284

Expected behavior

I would expect the "total" sequence length to include everything. I would otherwise call it the length of "nuclear" genome only.

Best regards, Matthieu

muffato avatar Sep 17 '24 20:09 muffato

Hi muffato

Thank you for highlighting this issue. I agree that it could be clearer, and we’ll work on improving it.

Nuala

olearyna avatar Sep 18 '24 20:09 olearyna