cmip6-cmor-tables icon indicating copy to clipboard operation
cmip6-cmor-tables copied to clipboard

Adding a license file to this repo for conda-forge releases?

Open chengzhuzhang opened this issue 2 years ago • 36 comments

@durack1 @mauzey1 Hey Paul and Chris, The E3SM data publication team is working on a solution to package external data needed for cmorizing E3SM data and release them as an independent data package. This way can significantly streamline our data publication workflow and make it more portable. To do so, we will need a license file to be included in [cmip6-cmor-tables](https://github.com/PCMDI/cmip6-cmor-tables) repo. A detailed discussion can be find here https://github.com/E3SM-Project/e3sm_to_cmip/issues/124. I'm wondering, would you please consider adding a license? Thanks a lot!

chengzhuzhang avatar Feb 08 '22 18:02 chengzhuzhang

@chengzhuzhang @xylar interesting question. In principle I have no problem adding a license, one question, what license flavour would make it easiest?

One reluctance I have in proceeding, while wrapping this repo into a conda package and automating its download will aid automated use, it further abstracts the requirement to register/update model/institution info to a user - it's not quite a set and forget process

durack1 avatar Feb 08 '22 19:02 durack1

Hey Paul, Thanks for chiming in. Since both CMIP6_CVs and cmip6-cmor-tables repos are open source on GitHub, perhaps it might not be too concerned that more users would skip the WCPR registration requirement through a conda package? The conda package does add one more entry point to these tables though.

For a work-around, It seems like https://github.com/WCRP-CMIP/CMIP6_CVs and cmip6-cmor-tables share the same set of json files? The former has a license already. Maybe we should use CMIP6_CVs instead? (and I'm curious what's the difference between both?) Thank you.

chengzhuzhang avatar Feb 08 '22 20:02 chengzhuzhang

Hi Jill, the two repos are separate. WCRP-CMIP/CMIP6_CVs is the controlled vocabulary (and registration) repo for CMIP6, where institutions, models, experiments, MIPs, etc are defined. A subset of the registered information is pulled into the PCMDI/cmip6-cmor-tables/Tables/CMIP6_CV.json file, which allows a modeling group to use CMOR with minimum configuration, as the registered information is available for use by the software, packaged in the cmip6-cmor-tables.

The WCRP-CMIP/CMIP6_CVs license, is a default file license template for use in the netcdf global attribute license field, so isn't the kind of license that is required to create a conda-forge package (@xylar can correct me here if I'm wrong)

durack1 avatar Feb 08 '22 22:02 durack1

The WCRP-CMIP/CMIP6_CVs license, is a default file license template for use in the netcdf global attribute license field, so isn't the kind of license that is required to create a conda-forge package (@xylar can correct me here if I'm wrong)

I think that's correct. We need a license file that describes limitations (if any) on packaging and redistributing the contents of the repository. The license linked to on https://github.com/WCRP-CMIP/CMIP6_CVs is not that kind of license. It is a license the software adds to contents it produces.

xylar avatar Feb 08 '22 23:02 xylar

Hi Jill, the two repos are separate. WCRP-CMIP/CMIP6_CVs is the controlled vocabulary (and registration) repo for CMIP6, where institutions, models, experiments, MIPs, etc are defined. A subset of the registered information is pulled into the PCMDI/cmip6-cmor-tables/Tables/CMIP6_CV.json file, which allows a modeling group to use CMOR with minimum configuration, as the registered information is available for use by the software, packaged in the cmip6-cmor-tables.

The WCRP-CMIP/CMIP6_CVs license, is a default file license template for use in the netcdf global attribute license field, so isn't the kind of license that is required to create a conda-forge package (@xylar can correct me here if I'm wrong)

Hey Paul, thank you for the clarification. It seems one goal of this repo is to facilitate using it together with CMOR for modeling groups to create CMIP compliance files, then I think it makes sense to have conda release of this repository? The goal from E3SM side is to be able to port cmorization process easily to supported machines. And maintaining a conda package of this repo is a better approach than manually git clone and update this repo on different platform. @xylar and I will help maintaining the conda package if a license can be added. Thanks for your consideration!

chengzhuzhang avatar Feb 15 '22 20:02 chengzhuzhang

@chengzhuzhang in principle I have no problem with this, however, we'd need to know what license makes using a conda-forge archive easiest?

@mauzey1 do you see any problems with this?

durack1 avatar Feb 15 '22 23:02 durack1

Thank you @durack1 ! According to general guidelines for opensource software I'm thinking either Apache 2.0, BSD or MIT license would be proper (I'm reading on this page https://software.llnl.gov/about/licenses/). @xylar would you agree that any of this license type should be sufficient?

chengzhuzhang avatar Feb 15 '22 23:02 chengzhuzhang

@durack1 @chengzhuzhang You mean having the CMIP6 CMOR tables' JSON files stored in a conda-forge package? Like a package that would install the directory of tables in <anaconda_path>/envs/my_env/share/cmip6-cmor-tables?

mauzey1 avatar Feb 16 '22 00:02 mauzey1

@mauzey1, yes, exactly. I have a recipe for doing that already here: https://github.com/xylar/staged-recipes/tree/add_cmip-cmor-tables/recipes/cmip6-cmor-tables

I'm thinking either Apache 2.0, BSD or MIT license would be proper (I'm reading on this page https://software.llnl.gov/about/licenses/). @xylar would you agree that any of this license type should be sufficient?

@chengzhuzhang, those are definitely common licenses on conda-forge so the would work for me.

xylar avatar Feb 16 '22 00:02 xylar

@xylar Okay, then I agree with creating a conda-forge package for cmip6-cmor-tables.

mauzey1 avatar Feb 16 '22 00:02 mauzey1

I can certainly see the utility of having a conda package for the tables, but when it comes to versioning these tables please have a good think about the labelling of the package and how that updates.

For Met Office work I have a set of versions that we use for CMIP6 production (01.00.29, 01.00.31, 01.00.32), but we update the CMIP6_CV.json file within these directories with the latest one when something important changes ( I recall doing this for fixes to experiment parent details and the introduction of a new model). You could choose a composite version number to represent the inputs (data request version + CVs version (+ CMOR version?)). In some ways it would be good to separate the MIP tables from the CVs, but this would likely be a bit disruptive.

One other thought; how often would a new conda package be published? Updates to the CVs still occur reasonably regularly with new models being added, although a bit of automation should cover this fairly easily.

matthew-mizielinski avatar Feb 16 '22 13:02 matthew-mizielinski

@matthew-mizielinski, conda-forge can easily accommodate versioning like you suggest.

A bot can automatically create a new package each time there is a new version update as long as the recipe doesn't need to change (other than the version and sha256 hash of the files, which the bot updates). Maintenance on that side should be a piece of cake.

xylar avatar Feb 16 '22 14:02 xylar

@xylar @matthew-mizielinski thanks for getting into the weeds with this. We do capture the CV, DREQ and CMOR versions in the release comment (see here), but importantly, a new version is not released when the CVs change (which happens relatively frequently), so unless we changed that process, the conda package would never update with the latest CVs - requiring a manual step by a user, not ideal

durack1 avatar Feb 16 '22 18:02 durack1

How important would the CVs be for the tools that would typically use the proposed conda-forge package? It isn't necessarily a problem if that file gets updated infrequently, as long as that is clear to users of the conda-forge package.

xylar avatar Feb 16 '22 19:02 xylar

The path that a new modeling group is meant to follow, to use CMOR

  1. register their institution and model in the CVs (institution_id and source_id)
  2. this information is then propagated across to the cmip6-cmor-tables/Tables/CMIP6_CV.json file, and then a user can just select their information from the preconfigured/registered info to write files (in addition to configuring input files for CMOR use)
  3. as the information is registered in the CMIP6_CVs this then triggers downstream support, ESGF publisher, citation, ES-DOCs etc in a consistent way

So to aid a user, having the most up-to-date CMIP6_CV.json file would certainly be a necessity, otherwise hand-spun edits will be required which may break consistency with the registered information that other software expects

durack1 avatar Feb 16 '22 19:02 durack1

Okay, that's good to know. Why not do a release every time the CMIP6_CV.json file gets updated?

xylar avatar Feb 16 '22 19:02 xylar

There is no way to do a conda-forge package from anything other than a release and it isn't a good idea to overwrite or edit files in a conda environment, since they might unexpectedly get overwritten by a later update, etc.

xylar avatar Feb 16 '22 19:02 xylar

@xylar there has been no need to (up until now), and our versioning tag doesn't account for the CV version, however, the tag/release comment does

durack1 avatar Feb 16 '22 19:02 durack1

Okay, I'll leave this to the rest of you to discuss. I know for e3sm_to_cmip, the current situation of each user cloning this repo is not working well. I'm happy to help with conda-forge packaging if that ends up being practical. But I don't understand enough of the subtleties or who the end users might be to weigh in on those details.

xylar avatar Feb 16 '22 19:02 xylar

@xylar a practical question from me, do you expect folks to run conda update on their env every time they go to use it?

durack1 avatar Feb 16 '22 19:02 durack1

That would be a question for @chengzhuzhang. It sounds like it might be worth including a conda update as part of the e3sm_to_cmip workflow to make sure the latest version of cmip6-cmor-tables is being used.

xylar avatar Feb 16 '22 19:02 xylar

Hmmm.. conda update would only help if the package is released frequent enough (i.e., each time user facing features are updated?). Now I learned more about the use of this repo. I understand for an existing registered modeling center, most changes of CMIP6_CV.json doesn't really impact, unless its own registered information is updated though.

chengzhuzhang avatar Feb 16 '22 19:02 chengzhuzhang

@chengzhuzhang yeah exactly, so when we updated E3SM1-0 to include the UCI institution_id, if you had cloned the repo after this was merged (and pulled across into the cmip6-cmor-tables repo) then you'd have no further tweaks to apply, assuming that you're also using the latest ESGF publisher which may validate your entries

durack1 avatar Feb 16 '22 19:02 durack1

thanks for all the clarifications @durack1 , with this, I think without changing current release plan, a conda package may not be the best and most practical approach...

@xylar, I think there is another possibility: that is to manage data in this repo in the similar fashion as managing our analysis dataset. This is a less automatic approach, but we will perhaps have better control over schedules for updating these data file. We can talk more offline about this possibility.

chengzhuzhang avatar Feb 16 '22 22:02 chengzhuzhang

@chengzhuzhang, let's pause for a second and let me chat with @taylor13, @matthew-mizielinski and @mauzey1 to figure out if a conda package makes sense. To be honest, I like the ease of use this would enable, however, there are questions remaining in how we'd make sure things are kept up to date without causing more problems (and work) than it solves

durack1 avatar Feb 16 '22 22:02 durack1

@chengzhuzhang, another feasible approach to have e3sm_to_cmip clone this GitHub repo into some predefined scratch space (specific to a user, rather than shared) as the first step of running? That way, you would always have the latest version without the need for a release?

It sounds like my approach or yours are more feasible than a conda-force package.

Thank you everyone for the discussion.

xylar avatar Feb 16 '22 23:02 xylar

@xylar either way, adding a license is a trivial step, would CC BY-SA 4.0 cause you any issues in conda-forge land?

durack1 avatar Feb 16 '22 23:02 durack1

@durack1, CC BY-SA 4.0 should work just fine. I was able to find several existing packages with that license.

xylar avatar Feb 16 '22 23:02 xylar

@xylar, just a thought, but I use a python class for something similar; cartopy.io.Downloader

@durack1, it wouldn't be too hard to write an api to retrieve tables and CVs with something like this downloader class. That would be something that could go to conda, and a simple function call like cmor_tables_location(table_version='01.00.33', cv_version='latest', destination='<somewhere>') would be able to provide access to any version of the tables.

matthew-mizielinski avatar Feb 17 '22 10:02 matthew-mizielinski

@matthew-mizielinski, so I've given the downloader some thought. I think what would work well for our software (e3sm_to_cmip) is to have the conda package for most files cmip6-cmor-tables. This is preferable to a downloader in that all we have to do is make a specific version (or a constrained version) of the package a dependency of our software and the downloading happens automatically without any extra steps. But a simple download command for cmip6-cmor-tables/Tables/CMIP6_CV.json at the beginning of our process is a really good idea, so we don't rely on the (potentially outdated) version from the packag. Since that's a single file, it wouldn't require anything fancy like its own downloader, we could just use requests.

xylar avatar Feb 19 '22 02:02 xylar