datasets icon indicating copy to clipboard operation
datasets copied to clipboard

method suggestions to access data from large list of accession ids between v13 and v14

Open csgenomics-admin opened this issue 1 year ago • 2 comments

Hi all,

Thank you for your continued development of a terrific set of tools!

We have been using v13 (v13.43.2) of NCBI Datasets via the CLI to access gene summary data from an input list of about 20,000 transcript identifiers; the program handles the request without issue. However, when I tested out v14 CLI this morning with the same list, I get the following error:

Error: [gateway] Internal server error - request timed out (For more help, see the NCBI Datasets Documentation at https://www.ncbi.nlm.nih.gov/datasets/docs/) (2238ABA88E40A8A21A5DD2E8.3.1)

In the example below, I've created a list of integers and using those as NCBI Gene IDs, even though in my case I supply Transcript IDs. The behavior/response regarding the error is identical though whether I supply Gene or Transcript IDs:

# creating two lists: one with 20 Gene IDs, one with 20,000 Gene IDs
seq 20 > list_small.txt
seq 20000 > list_large.txt

# accessing the data
datasets summary gene gene-id --inputfile list_small.txt --as-json-lines > small.jsonl
datasets summary gene gene-id --inputfile list_large.txt --as-json-lines > large.jsonl

If I run the above code using v13, I produce both small.jsonl and .large.json outputs. However, with v14, I can only get the small.jsonl file to be output.

I was wondering how to best approach the problem of sending in long lists of identifiers. Even if I pass in the --api-key flag, I still get the timeout error with v14. Perhaps with v14 I'll need to split up my list into smaller chunks - if so, can you recommend a max length of that list?

Relatedly, if we instead opted to use the REST API (I think this one specifically?) to access these data, might you be able to suggest how to similarly deal with a long list of identifiers (or alternatively, what the maximum size of that list should be)?

Appreciate any insights you can offer

csgenomics-admin avatar Oct 12 '22 16:10 csgenomics-admin

Hi csgenomics-admin,

Thanks for the bug report. We have confirmed that there is an issue in CLI v14 with handling large lists of GeneIDs and are working on a fix. In the meantime, we recommend using v13. I will post an update on this thread when the bug is fixed.

Best, Eric Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI [email protected]

ericcox1 avatar Oct 13 '22 15:10 ericcox1

Appreciate the quick response Eric, For the moment, we've pivoted from using the CLI method to obtain a single .jsonl file for all transcripts in our list to using the v13 API method. That seemed to have done the trick :)

csgenomics-admin avatar Oct 14 '22 02:10 csgenomics-admin

Hi csgenomics-admin,

The bug in CLI v14 with handling large lists of GeneIDs has been fixed.

datasets summary gene taxon human --as-json-lines | dataformat tsv gene --elide-header --fields gene-id > human-gene-ids.list
head -20000 human-gene-ids.list > list_large.txt
datasets summary gene gene-id --inputfile list_large.txt --as-json-lines > large.jsonl

I'm closing this issue for now.

Best, Eric

ericcox1 avatar Oct 20 '22 20:10 ericcox1