nextstrain.org
nextstrain.org copied to clipboard
Improve / get rid of the manifest JSON
Currently the nextstrain.org server (for "core" builds and "staging" builds) fetches a "manifest" JSON from the respective S3 buckets. Historically, this data was hardcoded in auspice. There was, for a time, a concept of "users" (which is why it's labelled "manifest_guest.json" however this is no longer used or desired.
This manifest data is used for two purposes:
- generate the appropriate response to an auspice-derived "getAvailable" request (see https://nextstrain.github.io/auspice/server/api#charon-getavailable, https://github.com/nextstrain/nextstrain.org/blob/master/auspice/server/setAvailableDatasets.js). Auspice uses this to display the different datasets in the sidebar dropdowns. Note that the format of the "getAvaialble" request differs from the JSON on the S3 bucket.
- To potentially redirect requests - e.g.
/mumps
goes to/mumps/na
. This is related to https://github.com/nextstrain/nextstrain.org/issues/40, where community builds do not handle redirects such as these. See https://github.com/nextstrain/nextstrain.org/blob/master/auspice/server/getDatasetHelpers.js#L98.
The manifest JSON has been problematic from time to time, but in general it has been adequate for our needs. It is designed so that "new" JSONs are first pushed to the staging server, whose datasets should generally mirror those on the core bucket, however neither of those have been enforced.
It may now be time to get rid of the manifest JSON with a more robust and scalable solution.
This came up again today on Slack. Two considerations if we remove manifest_guest.json
and start crawling S3 for core/staging:
- [ ] There's a bunch of stuff in the buckets that isn't listed in the manifests which would start showing up, so we'd want to remove/cleanup those.
- [ ] We'd want to look at the perf. impact of having to crawl for each available datasets request. I think it'd be unwise to not add caching here, even though it'd add a bit of work.
The resolving of partial dataset paths to defaults could be maintained without the manifest file by implementing them elsewhere in the server instead. This would be very straightforward given my recentish routing work. It's a relatively short list of redirects, so could just be hardcoded and applied for core/staging. As generated from the current manifest:
/dengue → /dengue/denv1
/enterovirus → /enterovirus/d68/genome
/enterovirus/d68 → /enterovirus/d68/genome
/flu/avian → /flu/avian/h5n1/ha
/flu/avian/h5n1 → /flu/avian/h5n1/ha
/flu/avian/h5nx → /flu/avian/h5nx/ha
/flu/avian/h7n9 → /flu/avian/h7n9/ha
/flu/avian/h9n2 → /flu/avian/h9n2/ha
/flu → /flu/seasonal/h3n2/ha/2y
/flu/seasonal → /flu/seasonal/h3n2/ha/2y
/flu/seasonal/h3n2 → /flu/seasonal/h3n2/ha/2y
/flu/seasonal/h3n2/ha → /flu/seasonal/h3n2/ha/2y
/flu/seasonal/h3n2/na → /flu/seasonal/h3n2/na/2y
/flu/seasonal/h1n1pdm → /flu/seasonal/h1n1pdm/ha/2y
/flu/seasonal/h1n1pdm/ha → /flu/seasonal/h1n1pdm/ha/2y
/flu/seasonal/h1n1pdm/na → /flu/seasonal/h1n1pdm/na/2y
/flu/seasonal/vic → /flu/seasonal/vic/ha/2y
/flu/seasonal/vic/ha → /flu/seasonal/vic/ha/2y
/flu/seasonal/vic/na → /flu/seasonal/vic/na/2y
/flu/seasonal/yam → /flu/seasonal/yam/ha/2y
/flu/seasonal/yam/ha → /flu/seasonal/yam/ha/2y
/flu/seasonal/yam/na → /flu/seasonal/yam/na/2y
/mumps → /mumps/na
/ncov → /ncov/gisaid/global
/ncov/gisaid → /ncov/gisaid/global
/ncov/open → /ncov/open/global
/tb → /tb/global
/WNV → /WNV/NA
Thought not all apply as some are currently shadowed by other routing changes.