Gemma icon indicating copy to clipboard operation
Gemma copied to clipboard

Add an endpoint to retrieve dataset IDs

Open arteymix opened this issue 4 years ago • 5 comments

To migrate Nathaniel Jython scripts to use the RESTful API, we need to add an efficient way to retrieve a lot of datasets at once. This cannot be done with the current /datasets endpoint because it would dump too much data.

By retrieving only the IDs, one can simply synchronize the relevant entities to save precious time.

Option A) a brand new endpoint for getting all IDs

GET /datasets/allIds HTTP/1.1
  • [ ] cache the query that retrieves the IDs
  • [x] add a new loadAllIdsPreFilter to FilteringVoEnabledDao (even if that belong instead in a FilteringDao. but that does not exist yet)
  • [ ] support filtering & sorting (but no offset/limit since this is meant for bulk retrieval)

Downside: a dataset with the short name allIds would create a conflict.

Option B) an extract query parameter used to extract specific columns:

GET /datasets?extract=id

Upside: very versatile ?extract=shortName, ?extract=databaseEntry.accession, ?extract=arrayDesign.id, etc. Downside: maybe a bit more tricky to make it generic

We also need to think about how this would look like for result sets, platforms, genes, etc.

Involved collaborators:

  • @nzllim
  • @Tu1026 will likely do the implementation on the client side (rewriting Nathaniel scripts, etc.)

arteymix avatar Oct 19 '21 19:10 arteymix

The best would be an endpoint. There's already all the support needed for getting a set of IDs give a filter and it's even possible to sort them.

arteymix avatar Aug 16 '23 05:08 arteymix

The ?extract approach is not very OpenAPI-friendly because it will produce two endpoints with different payload signature.

arteymix avatar Jan 25 '24 18:01 arteymix

Don't know how necessary this is. Nathaniel's scripts ran weekly. Getting all datasets by pagination using the /datasets endpoint takes a little more than a minute. Would recommend closing this one

oganm avatar Feb 16 '24 01:02 oganm

Me neither to be honest. If you need to dump IDs and filter them in some fashion, there's already the filter parameter for that purpose.

The only downside with pages for dumping everything is that it is not consistent: datasets can be modified, created or even deleted while browsing is happening.

arteymix avatar Feb 16 '24 16:02 arteymix

Could we add a little hash that represents the whole result as part of the output? That'd allow checking for integrity

oganm avatar Feb 16 '24 18:02 oganm