nextstrain.org icon indicating copy to clipboard operation
nextstrain.org copied to clipboard

`/nextclade/sars-cov-2` is outdated

Open victorlin opened this issue 11 months ago • 9 comments

The data on https://nextstrain.org/nextclade/sars-cov-2 is not up to date with the latest SARS-CoV-2 Nextclade dataset releases at https://github.com/nextstrain/nextclade_data.

This is not desirable since users may go to the page thinking they have the latest data, when in reality it is now 6 months out of date. Especially since the page is highlighted on the home page under Featured Analyses, and shown on the Pathogens page.

Possible solutions

Extracted from meeting notes.

  1. [@victorlin ✅] (extra-short-term) Add a warning to the page with instructions on how to view the latest tree.

    Image

  2. 🟢 (short-term) Mirroring of Nextclade datasets under data.nextstrain.org/nextclade/

    • This matches current strategy for data.nextstrain.org/nextclade_sars-cov-2.json (but this hasn't been updated since Oct, the automated dataset provisioning would need to be updated to push here as well)

    • This has a nice feature of allowing simple calls with dates to pull up correct nomenclature ie nextstrain remote download nextstrain.org/nextclade/sars-cov-2@2023-01-01 

    • This would also surface datasets in a nice fashion in the cards UI that displays update frequency, etc...

    • This mirroring would cause multiple sources of truth

    • The current Oct 2024 update data is a bug, the new update scripts should be pushing to data.nextstrain.org as well as making nextclade releases. This bug should be fixed.

  3. ⛔️ Some other solution that gives a persistent URL

  4. 🟢 (long-term) Or what about taking data.clades.nextstrain.org and exposing Auspice JSONs in this bucket in an analogous fashion to how we expose Auspice JSONs in data.nextstrain.org

victorlin avatar Apr 25 '25 21:04 victorlin

nextstrain.org can read the Nextclade datasets index, https://data.clades.nextstrain.org/v3/index.json, instead of crawling the bucket. These are the datasets (by full name) that have a tree.json file:

jq> .collections | map(.datasets[] | select(.files.treeJson) | .path)
[
  "nextstrain/sars-cov-2/wuhan-hu-1/orfs",
  "nextstrain/sars-cov-2/wuhan-hu-1/proteins",
  "nextstrain/sars-cov-2/BA.2.86",
  "nextstrain/flu/h1n1pdm/ha/CY121680",
  "nextstrain/flu/h1n1pdm/ha/MW626062",
  "nextstrain/flu/h1n1pdm/na/MW626056",
  "nextstrain/flu/h3n2/ha/CY163680",
  "nextstrain/flu/h3n2/ha/EPI1857216",
  "nextstrain/flu/h3n2/na/EPI1857215",
  "nextstrain/flu/vic/ha/KX058884",
  "nextstrain/flu/vic/na/CY073894",
  "nextstrain/flu/yam/ha/JN993010",
  "nextstrain/rsv/a/EPI_ISL_412866",
  "nextstrain/rsv/b/EPI_ISL_1653999",
  "nextstrain/mpox/all-clades",
  "nextstrain/mpox/clade-i",
  "nextstrain/mpox/clade-iib",
  "nextstrain/mpox/lineage-b.1",
  "nextstrain/orthoebolavirus/ebov",
  "nextstrain/measles/genome/WHO-2012",
  "nextstrain/measles/N450/WHO-2012",
  "nextstrain/dengue/all",
  "nextstrain/yellow-fever/prM-E",
  "nextstrain/hmpv/all-clades/NC_039199",
  "nextstrain/rubella/E1",
  "nextstrain/mumps/sh",
  "nextstrain/mumps/genome",
  "nextstrain/rubella/genome",
  "nextstrain/sars-cov-2/BA.2",
  "nextstrain/sars-cov-2/XBB",
  "community/isuvdl/mazeller/prrsv1/orf5/yimim2025",
  "community/isuvdl/mazeller/prrsv2/orf5/yimim2023",
  "community/neherlab/hiv-1/hxb2",
  "community/moncla-lab/iav-h5/ha/2.3.4.4",
  "community/moncla-lab/iav-h5/ha/all-clades",
  "community/moncla-lab/iav-h5/ha/2.3.2.1",
  "community/v-gen-lab/dengue/denv1",
  "community/v-gen-lab/dengue/denv2",
  "community/v-gen-lab/dengue/denv3",
  "community/v-gen-lab/dengue/denv4",
  "community/genspectrum/marburg/HK1980/all-lineages",
  "community/pathoplexus/cchfv/L",
  "community/pathoplexus/cchfv/S",
  "community/pathoplexus/cchfv/M",
  "community/v-gen-lab/chikV/genotypes"
]

We could serve the trees for these datasets under a URL like https://nextstrain.org/nextclade/….

The index also contains version information for each dataset, so we can make those available via nextstrain.org's @YYYY-MM-DD syntax.

For me, the outstanding questions (all relatively minor) are of naming:

  1. I assume we want to include the community/ datasets to not special-privilege ourselves here? I don't think it'll be confusing with the existing nextstrain.org/community/… concept.
  2. Do we want to include the leading nextstrain/ part of datasets or assume that's implied? Including it is better for predictability, but it does feel repetitive and unnecessary.
  3. Do we want to support aliases/short names ("shortcuts" in the index's terms), e.g. sars-cov-2 meaning nextstrain/sars-cov-2/wuhan-hu-1/orfs? If so, do we expand it (i.e. via redirect) or accept it as-is as simply a secondary name?
  4. Do we want to prefer the short names in some/most cases?

tsibley avatar Oct 15 '25 17:10 tsibley

Some proposed answers to my questions, from my perspective:

  1. Include community/ datasets.
  2. Drop the leading nextstrain/ from dataset names, but accept it as an alias by redirecting (e.g. https://nextstrain.org/nextclade/nextstrain/mpox/clade-iib → https://nextstrain.org/nextclade/mpox/clade-iib).
  3. Accept the index's shortcut names, but expand them by redirection to the canonical name (e.g. https://nextstrain.org/nextclade/hMPXV → https://nextstrain.org/nextclade/mpox/clade-iib). Some shortcuts have _ in their name (e.g. flu_h1n1pdm_na); accept those both as-is and with s{_}{/}g applied (e.g. flu/h1n1pdm/na).
  4. Prefer full names (minus leading nextstrain/) as the canonical name

tsibley avatar Oct 15 '25 18:10 tsibley

That all sounds pretty good to me

jameshadfield avatar Oct 15 '25 21:10 jameshadfield

Sounds good. Is 4 not a part of 3 (i.e. expansion is a preference for full names)? Either way, +1 for preferring full names since there can be multiple shortcuts associated with a dataset.

victorlin avatar Oct 15 '25 22:10 victorlin

@victorlin Given the proposed answers, yes, 4 and 3 are two sides of the same coin.

tsibley avatar Oct 21 '25 17:10 tsibley

I have this all working as proposed, with just previous versions left to expose.

I'm (still) wishing the existing versions handling was fully part of the Source/Resource/Subresource models rather than bolted on the side of them, as this would fit the Nextclade use case better and be more clearly extended.

In any case, I have a choice between:

  1. Extending the resource index generator (resourceIndex/main.js) to include the Nextclade trees (based on the Nextclade index).

    Upside: It works within the same framework as our core/staging versions.

    Downside: It'll be artificially/unnecessarily laggy for new versions, by about 25h at worst.

  2. Extending the resource index loader (src/resourceIndex.js) to include the Nextclade trees (based on the Nextclade index).

    Upside: The artificial/unnecessary lag is only ~1h at worst (but there's still unnecessary lag).

    Downside: It subverts the resource index framework a bit and maybe makes it less inspectable (e.g. you can't just look at the indexer file to see what's up).

I'm leaning towards (1) because it feels more like "playing nice" with the current design (even if I think the current design is lacking) but I feel bad about the lag.

Do folks have their own thoughts/preferences here? I'd appreciate weighing in.

There's a third option I briefly considered but discarded because it feels too much like not "playing nice":

  1. Creating a parallel resource version framework more in line with my design thinking re: Source/Resource/Subresource models and then making little adapters so it works with the existing resource version listing stuff for now.

    Upside: Shows a path towards what I think is a better design for the future.

    Downside: Introduces another way of doing things without concrete plans (since I'm leaving) to continue down that path.

tsibley avatar Oct 22 '25 18:10 tsibley

Option 1 seems fine. Is there lag with option 3?

In the event that we consider option 3 in the future, do you think option 1 a big leap in the wrong direction, or just a small leap given current usage of the models?

victorlin avatar Oct 22 '25 20:10 victorlin

👍

No lag with option 3.

I don't think option 1 itself solidifies the current direction much, beyond providing a second example of working with it and not providing an example of a different (more suitable, IMO) approach.

tsibley avatar Oct 22 '25 20:10 tsibley

  1. Creating a parallel resource version framework more in line with my design thinking re: Source/Resource/Subresource models

I described my thinking here a bit more in https://github.com/nextstrain/nextstrain.org/issues/1251.

tsibley avatar Nov 05 '25 22:11 tsibley