nextstrain.org icon indicating copy to clipboard operation
nextstrain.org copied to clipboard

Routinely collect shared datasets & narratives

Open jameshadfield opened this issue 3 years ago • 3 comments

Problem summary:

Datasets & narratives are able to be visualised in Nextstrain through various mechanisms: core, staging, groups (public & private) and community. In order to achieve #305 we need to produce a file listing these datasets & narratives which the client can fetch. In the context of private groups, authentication is required for the client to view these.

Prior art

We currently have various ways of collecting these data:

  • The dataset behind https://nextstrain.org/influenza is regenerated ever 5 minutes (!) via a GitHub action in the nextstrain/nextstrain.org repo which iterates over the objects in our main S3 bucket.
  • There are scripts available which use the GitHub API to search all public repos for datasets which match the required syntax for community sharing functionality.
  • The dataset behind nextstrain.org/sars-cov-2 is stored in the GitHub repo and is modified via PRs.
  • Proof-of-principle script & UI to display all known public datasets: https://github.com/nextstrain/nextstrain.org/pull/303. JSON is 22kb (gzipped). Takes c. 20 seconds to scan all public S3 buckets, and this time isn't blocking.

Augmented data

The above examples limit the (meta)data we know about each dataset/narrative to filename, source, date the file was uploaded. By reading the file itself we can gather further information such as maintainer/author, number of samples in dataset, date updated etc. This task is treated as an extension of that described by this issue, which doesn’t propose reading the file contents themselves.

Data freshness

How frequently we update the database (flat file) here is important, but not crucially so. My gut feeling is that regenerating this file every 12 hours is good enough for the medium-term future. (Important to note that a dataset/narrative can be visualised independent of work in this issue, assuming one knows the URL!)

Dynamically generated the database server-side each time a client requests the data (i.e. the server calls S3 / GitHub APIs) is not recommended as this would unnecessarily increase the required work server-side. Subsetting a cached file based on user authz per-client-request is probably fine.

For sources where we control the S3 bucket (all sources except community), it would be possible to regenerate these files whenever the underlying bucket is updated by using S3 event notifications.

Note that if requests for this data are made to our server, and the flat-file is generated elsewhere, we have to ensure the server updates its cache appropriately.

Proposed method(s)

The simplest method would be to modify currently available scripts (see “Prior Art”) to scan all of our (public) S3 buckets and upload this to a flat-file on a S3 bucket (probably nextstrain-data). This script could be run via GitHub actions once a day. The client could fetch this directly, thus avoiding cache-invalidation concerns (S3 serves requests with a max-age=0). The community data could be similarly generated.

When considering private (groups) datasets, a different solution is needed. If we are able to run the above scripts on the nextstrain.org server periodically (e.g. every 24 hours / on each deployment) then we could accomplish this using a similar approach (as the server can access all private groups), storing the results in-memory on the server and client requests going to nextstrain.org. Queries would have to take into account the user data to restrict the returned results appropriately. If we go this way, it would probably be easier to compute the core, staging and public groups data server-side also (but not community data, as this takes a while). @tsibley what are your recommendations here?

Note that each page (see #305) defines its own address for where to fetch the database from, so it’s not problematic to have some through nextstrain.org and some on S3.

jameshadfield avatar Apr 12 '21 03:04 jameshadfield

Moving a conversation on this topic from slack over to here:

@tsibley said:

If we want to include private groups, then each user will have to make a request to a dynamic listing instead of a single shared listing. The dynamic listing could be in combination with a shared, public listing.

Authn will flow naturally from the user's session cookie.

Based on that feedback, my suggestion was:

Does this mean I could try to implement something now that excludes private groups that uses a public listing (which combines public groups s3 bucket listings into one) and then we could later incorporate (in combination with the former) user requests and authentication for private groups using a dynamic listing?

If so, does that seem reasonable or do you think that approach has some disadvantage to trying to include private groups from the start?

@tsibley said this is a reasonable option.

So as we start implementing pages that will depend on this public, static listing (e.g. /groups, /core, /all), we can provision it as @jameshadfield suggests:

scan all of our (public) S3 buckets and upload this to a flat-file on a S3 bucket (probably nextstrain-data). This script could be run via GitHub actions once a day.

aka something like https://github.com/nextstrain/nextstrain.org/blob/590e63c202838e9e61d9971bbcb66fd00c14a649/scripts/tmp-collect-all.js

Then next steps might include:

  • incorporating authentication / authorized requests for private datasets from a dynamic listing on top of / in combination with the static listing for public datasets.
  • allowing for smarter provisioning of the static public listing (not doing unnecessary work when no new datasets exist, and directly triggering the addition of new datasets as they become available).

@jameshadfield @tsibley does this makes sense? Am I missing anything?

eharkins avatar May 05 '21 18:05 eharkins

@eharkins That all makes sense. I think it's a fine initial foray in the direction we want to go!

On the face of it, S3 inventories might be a useful thing to base this on instead of calling ListObjects ourselves, but the downsides to that S3 feature (daily only; can't trigger on demand; scatters parts of the machinery into different places, instead of consolidated into one program) probably make it a net negative choice.

tsibley avatar May 12 '21 23:05 tsibley

Thanks @tsibley. Worth noting that for /groups, i.e. #316, we ended up forgoing a static listing for now by implementing a metasource to collect group Sources, i.e. #324. This just means that the only pages which might initially depend on a static listing will be those which include community or anything beyond core and groups datasets.

eharkins avatar May 25 '21 17:05 eharkins