`/nextclade/sars-cov-2` is outdated
The data on https://nextstrain.org/nextclade/sars-cov-2 is not up to date with the latest SARS-CoV-2 Nextclade dataset releases at https://github.com/nextstrain/nextclade_data.
This is not desirable since users may go to the page thinking they have the latest data, when in reality it is now 6 months out of date. Especially since the page is highlighted on the home page under Featured Analyses, and shown on the Pathogens page.
Possible solutions
Extracted from meeting notes.
-
[@victorlin ✅] (extra-short-term) Add a warning to the page with instructions on how to view the latest tree.
-
🟢 (short-term) Mirroring of Nextclade datasets under data.nextstrain.org/nextclade/
-
This matches current strategy for data.nextstrain.org/nextclade_sars-cov-2.json (but this hasn't been updated since Oct, the automated dataset provisioning would need to be updated to push here as well)
-
This has a nice feature of allowing simple calls with dates to pull up correct nomenclature ie nextstrain remote download nextstrain.org/nextclade/sars-cov-2@2023-01-01
-
This would also surface datasets in a nice fashion in the cards UI that displays update frequency, etc...
-
This mirroring would cause multiple sources of truth
-
The current Oct 2024 update data is a bug, the new update scripts should be pushing to data.nextstrain.org as well as making nextclade releases. This bug should be fixed.
-
-
⛔️ Some other solution that gives a persistent URL
-
Or relying on GitHub? github.com/nextstrain/nextclade_data/blob/master/data/nextstrain/sars-cov-2/wuhan-hu-1/orfs/tree.json
-
In this case, I'd love an "official" strategy, ie work directly from github.com/nextstrain/nextclade_data/tree/master/data/nextstrain, etc...
-
Or is the official strategy to use nextclade dataset get and not use a URL? This seems to be referenced in the docs under Online Dataset Repository
-
nextclade dataset getis what Trevor should be using now for programmatically working with the SARS-CoV-2 Pango dataset and curling is not supported.
-
-
🟢 (long-term) Or what about taking data.clades.nextstrain.org and exposing Auspice JSONs in this bucket in an analogous fashion to how we expose Auspice JSONs in data.nextstrain.org
-
This would allow a simple cards UI at nextstrain.org/nextclade-datasets/ (or some other URL)
-
This is the preferred strategy, moving forward.
-
nextstrain.org can read the Nextclade datasets index, https://data.clades.nextstrain.org/v3/index.json, instead of crawling the bucket. These are the datasets (by full name) that have a tree.json file:
jq> .collections | map(.datasets[] | select(.files.treeJson) | .path)
[
"nextstrain/sars-cov-2/wuhan-hu-1/orfs",
"nextstrain/sars-cov-2/wuhan-hu-1/proteins",
"nextstrain/sars-cov-2/BA.2.86",
"nextstrain/flu/h1n1pdm/ha/CY121680",
"nextstrain/flu/h1n1pdm/ha/MW626062",
"nextstrain/flu/h1n1pdm/na/MW626056",
"nextstrain/flu/h3n2/ha/CY163680",
"nextstrain/flu/h3n2/ha/EPI1857216",
"nextstrain/flu/h3n2/na/EPI1857215",
"nextstrain/flu/vic/ha/KX058884",
"nextstrain/flu/vic/na/CY073894",
"nextstrain/flu/yam/ha/JN993010",
"nextstrain/rsv/a/EPI_ISL_412866",
"nextstrain/rsv/b/EPI_ISL_1653999",
"nextstrain/mpox/all-clades",
"nextstrain/mpox/clade-i",
"nextstrain/mpox/clade-iib",
"nextstrain/mpox/lineage-b.1",
"nextstrain/orthoebolavirus/ebov",
"nextstrain/measles/genome/WHO-2012",
"nextstrain/measles/N450/WHO-2012",
"nextstrain/dengue/all",
"nextstrain/yellow-fever/prM-E",
"nextstrain/hmpv/all-clades/NC_039199",
"nextstrain/rubella/E1",
"nextstrain/mumps/sh",
"nextstrain/mumps/genome",
"nextstrain/rubella/genome",
"nextstrain/sars-cov-2/BA.2",
"nextstrain/sars-cov-2/XBB",
"community/isuvdl/mazeller/prrsv1/orf5/yimim2025",
"community/isuvdl/mazeller/prrsv2/orf5/yimim2023",
"community/neherlab/hiv-1/hxb2",
"community/moncla-lab/iav-h5/ha/2.3.4.4",
"community/moncla-lab/iav-h5/ha/all-clades",
"community/moncla-lab/iav-h5/ha/2.3.2.1",
"community/v-gen-lab/dengue/denv1",
"community/v-gen-lab/dengue/denv2",
"community/v-gen-lab/dengue/denv3",
"community/v-gen-lab/dengue/denv4",
"community/genspectrum/marburg/HK1980/all-lineages",
"community/pathoplexus/cchfv/L",
"community/pathoplexus/cchfv/S",
"community/pathoplexus/cchfv/M",
"community/v-gen-lab/chikV/genotypes"
]
We could serve the trees for these datasets under a URL like https://nextstrain.org/nextclade/….
The index also contains version information for each dataset, so we can make those available via nextstrain.org's @YYYY-MM-DD syntax.
For me, the outstanding questions (all relatively minor) are of naming:
- I assume we want to include the
community/datasets to not special-privilege ourselves here? I don't think it'll be confusing with the existing nextstrain.org/community/… concept. - Do we want to include the leading
nextstrain/part of datasets or assume that's implied? Including it is better for predictability, but it does feel repetitive and unnecessary. - Do we want to support aliases/short names ("shortcuts" in the index's terms), e.g.
sars-cov-2meaningnextstrain/sars-cov-2/wuhan-hu-1/orfs? If so, do we expand it (i.e. via redirect) or accept it as-is as simply a secondary name? - Do we want to prefer the short names in some/most cases?
Some proposed answers to my questions, from my perspective:
- Include
community/datasets. - Drop the leading
nextstrain/from dataset names, but accept it as an alias by redirecting (e.g. https://nextstrain.org/nextclade/nextstrain/mpox/clade-iib → https://nextstrain.org/nextclade/mpox/clade-iib). - Accept the index's shortcut names, but expand them by redirection to the canonical name (e.g. https://nextstrain.org/nextclade/hMPXV → https://nextstrain.org/nextclade/mpox/clade-iib). Some shortcuts have
_in their name (e.g.flu_h1n1pdm_na); accept those both as-is and withs{_}{/}gapplied (e.g.flu/h1n1pdm/na). - Prefer full names (minus leading
nextstrain/) as the canonical name
That all sounds pretty good to me
Sounds good. Is 4 not a part of 3 (i.e. expansion is a preference for full names)? Either way, +1 for preferring full names since there can be multiple shortcuts associated with a dataset.
@victorlin Given the proposed answers, yes, 4 and 3 are two sides of the same coin.
I have this all working as proposed, with just previous versions left to expose.
I'm (still) wishing the existing versions handling was fully part of the Source/Resource/Subresource models rather than bolted on the side of them, as this would fit the Nextclade use case better and be more clearly extended.
In any case, I have a choice between:
-
Extending the resource index generator (
resourceIndex/main.js) to include the Nextclade trees (based on the Nextclade index).Upside: It works within the same framework as our core/staging versions.
Downside: It'll be artificially/unnecessarily laggy for new versions, by about 25h at worst.
-
Extending the resource index loader (
src/resourceIndex.js) to include the Nextclade trees (based on the Nextclade index).Upside: The artificial/unnecessary lag is only ~1h at worst (but there's still unnecessary lag).
Downside: It subverts the resource index framework a bit and maybe makes it less inspectable (e.g. you can't just look at the indexer file to see what's up).
I'm leaning towards (1) because it feels more like "playing nice" with the current design (even if I think the current design is lacking) but I feel bad about the lag.
Do folks have their own thoughts/preferences here? I'd appreciate weighing in.
There's a third option I briefly considered but discarded because it feels too much like not "playing nice":
-
Creating a parallel resource version framework more in line with my design thinking re: Source/Resource/Subresource models and then making little adapters so it works with the existing resource version listing stuff for now.
Upside: Shows a path towards what I think is a better design for the future.
Downside: Introduces another way of doing things without concrete plans (since I'm leaving) to continue down that path.
Option 1 seems fine. Is there lag with option 3?
In the event that we consider option 3 in the future, do you think option 1 a big leap in the wrong direction, or just a small leap given current usage of the models?
👍
No lag with option 3.
I don't think option 1 itself solidifies the current direction much, beyond providing a second example of working with it and not providing an example of a different (more suitable, IMO) approach.
- Creating a parallel resource version framework more in line with my design thinking re: Source/Resource/Subresource models
I described my thinking here a bit more in https://github.com/nextstrain/nextstrain.org/issues/1251.