nextstrain.org Improve / get rid of the manifest JSON

Improve / get rid of the manifest JSON

Open jameshadfield opened this issue 5 years ago • 1 comments

Currently the nextstrain.org server (for "core" builds and "staging" builds) fetches a "manifest" JSON from the respective S3 buckets. Historically, this data was hardcoded in auspice. There was, for a time, a concept of "users" (which is why it's labelled "manifest_guest.json" however this is no longer used or desired.

This manifest data is used for two purposes:

generate the appropriate response to an auspice-derived "getAvailable" request (see https://nextstrain.github.io/auspice/server/api#charon-getavailable, https://github.com/nextstrain/nextstrain.org/blob/master/auspice/server/setAvailableDatasets.js). Auspice uses this to display the different datasets in the sidebar dropdowns. Note that the format of the "getAvaialble" request differs from the JSON on the S3 bucket.
To potentially redirect requests - e.g. /mumps goes to /mumps/na. This is related to https://github.com/nextstrain/nextstrain.org/issues/40, where community builds do not handle redirects such as these. See https://github.com/nextstrain/nextstrain.org/blob/master/auspice/server/getDatasetHelpers.js#L98.

The manifest JSON has been problematic from time to time, but in general it has been adequate for our needs. It is designed so that "new" JSONs are first pushed to the staging server, whose datasets should generally mirror those on the core bucket, however neither of those have been enforced.

It may now be time to get rid of the manifest JSON with a more robust and scalable solution.

Dec 09 '19 06:12 jameshadfield

This came up again today on Slack. Two considerations if we remove manifest_guest.json and start crawling S3 for core/staging:

[ ] There's a bunch of stuff in the buckets that isn't listed in the manifests which would start showing up, so we'd want to remove/cleanup those.
[ ] We'd want to look at the perf. impact of having to crawl for each available datasets request. I think it'd be unwise to not add caching here, even though it'd add a bit of work.

The resolving of partial dataset paths to defaults could be maintained without the manifest file by implementing them elsewhere in the server instead. This would be very straightforward given my recentish routing work. It's a relatively short list of redirects, so could just be hardcoded and applied for core/staging. As generated from the current manifest:

/dengue → /dengue/denv1
/enterovirus → /enterovirus/d68/genome
/enterovirus/d68 → /enterovirus/d68/genome
/flu/avian → /flu/avian/h5n1/ha
/flu/avian/h5n1 → /flu/avian/h5n1/ha
/flu/avian/h5nx → /flu/avian/h5nx/ha
/flu/avian/h7n9 → /flu/avian/h7n9/ha
/flu/avian/h9n2 → /flu/avian/h9n2/ha
/flu → /flu/seasonal/h3n2/ha/2y
/flu/seasonal → /flu/seasonal/h3n2/ha/2y
/flu/seasonal/h3n2 → /flu/seasonal/h3n2/ha/2y
/flu/seasonal/h3n2/ha → /flu/seasonal/h3n2/ha/2y
/flu/seasonal/h3n2/na → /flu/seasonal/h3n2/na/2y
/flu/seasonal/h1n1pdm → /flu/seasonal/h1n1pdm/ha/2y
/flu/seasonal/h1n1pdm/ha → /flu/seasonal/h1n1pdm/ha/2y
/flu/seasonal/h1n1pdm/na → /flu/seasonal/h1n1pdm/na/2y
/flu/seasonal/vic → /flu/seasonal/vic/ha/2y
/flu/seasonal/vic/ha → /flu/seasonal/vic/ha/2y
/flu/seasonal/vic/na → /flu/seasonal/vic/na/2y
/flu/seasonal/yam → /flu/seasonal/yam/ha/2y
/flu/seasonal/yam/ha → /flu/seasonal/yam/ha/2y
/flu/seasonal/yam/na → /flu/seasonal/yam/na/2y
/mumps → /mumps/na
/ncov → /ncov/gisaid/global
/ncov/gisaid → /ncov/gisaid/global
/ncov/open → /ncov/open/global
/tb → /tb/global
/WNV → /WNV/NA

Thought not all apply as some are currently shadowed by other routing changes.

Mar 08 '22 23:03 tsibley

nextstrain.org nextstrain.org copied to clipboard

Improve / get rid of the manifest JSON

nextstrain.org
nextstrain.org copied to clipboard