Add an endpoint to retrieve dataset IDs
To migrate Nathaniel Jython scripts to use the RESTful API, we need to add an efficient way to retrieve a lot of datasets at once. This cannot be done with the current /datasets endpoint because it would dump too much data.
By retrieving only the IDs, one can simply synchronize the relevant entities to save precious time.
Option A) a brand new endpoint for getting all IDs
GET /datasets/allIds HTTP/1.1
- [ ] cache the query that retrieves the IDs
- [x] add a new
loadAllIdsPreFiltertoFilteringVoEnabledDao(even if that belong instead in aFilteringDao. but that does not exist yet) - [ ] support filtering & sorting (but no offset/limit since this is meant for bulk retrieval)
Downside: a dataset with the short name allIds would create a conflict.
Option B) an extract query parameter used to extract specific columns:
GET /datasets?extract=id
Upside: very versatile ?extract=shortName, ?extract=databaseEntry.accession, ?extract=arrayDesign.id, etc.
Downside: maybe a bit more tricky to make it generic
We also need to think about how this would look like for result sets, platforms, genes, etc.
Involved collaborators:
- @nzllim
- @Tu1026 will likely do the implementation on the client side (rewriting Nathaniel scripts, etc.)
The best would be an endpoint. There's already all the support needed for getting a set of IDs give a filter and it's even possible to sort them.
The ?extract approach is not very OpenAPI-friendly because it will produce two endpoints with different payload signature.
Don't know how necessary this is. Nathaniel's scripts ran weekly. Getting all datasets by pagination using the /datasets endpoint takes a little more than a minute. Would recommend closing this one
Me neither to be honest. If you need to dump IDs and filter them in some fashion, there's already the filter parameter for that purpose.
The only downside with pages for dumping everything is that it is not consistent: datasets can be modified, created or even deleted while browsing is happening.
Could we add a little hash that represents the whole result as part of the output? That'd allow checking for integrity