pangeo-cmip6-cloud icon indicating copy to clipboard operation
pangeo-cmip6-cloud copied to clipboard

The future of pangeo CMIP6 in the cloud

Open jbusecke opened this issue 2 years ago • 10 comments

I would like to start a high level discussion about the priorities and organization of the pangeo CMIP6 archive in the cloud.

I have officially? taken over this effort, and would first and foremost thank @naomi-henderson for her tireless work in getting this effort of the ground! It is amazing what is already possible with all this data.

Overall I would like to discuss:

  • The generation and organization of zart stores and the user facing catalog
  • How to provide more efficient/automated communication with the growing user base.

Data generation/organization I think the central model of organization works well. We store ‘datasets’ (time concatenated) as zarr stores in a cloud bucket, and use a csv based on the unique combination of “facets”. What we should discuss here is how to populate the cloud bucket. This is closely related to the question of how users can request additional datasets. Previously this was managed using a request form and manual creation/upload if the datasets. I hope we can eventually fully automate this, but I am aware that there might be a transition phase. Ultimately I would like to be able to build a pangeo-forge recipe from a dataset_id string

"mip_era.activity_id.institution_id.source_id.experiment_id.variant_label.table_id.variable_id.grid_label.version"

(exact facets used pending an answer here) , and it would build and upload the final zarr store entirely in the cloud (cc @cisaacstern). But since this is likely a more involved undertaking what do folks think about intermediate solutions?

Related: As suggested here we could consider augmenting the existing global variables with other attributes that might make things easier in the long run.

Filtering retracted datasets The basic idea (established by @naomi-henderson) is that we maintain a “raw” catalog with every store ever created, and then filter it to have a user-facing “main” catalog, which only includes valid datasets and only the newest versions. For a more in depth discussion see https://github.com/pangeo-data/pangeo-cmip6-cloud/issues/30.

User communication and derived datasets I think it is very important that we establish better visibility of “what is going on” and some sort of way to inform users of new developments. I really want to minimize the interactions via email! Some ideas I had:

  • Automated twitter bot for the newest datasets
  • An automatically updated “whats new” section on this repos docs. Ideally this would also include a web interface with search for all the datasets that have already been uploaded/requested and their status (e.g. “has problems” “in queue”). If we build this it might actually be possible to be able to index ALL esgf datsets and then request them with the click of a button if they are not yet in the cloud (one can dream right?).

Any other ideas from ppl here?

Another issue that I (and seemingly other users) care about a lot is the issue of derived datasets. I have been prototyping some ways to generate derived datasets but am not quite sure what the best way is for a broader audience to contribute datasets. You can find more discussion .

cc @rabernat

jbusecke avatar Jan 28 '22 21:01 jbusecke