Various basic OCaml packages don't have a license field
I maintain a project called "Coq Platform" which is essentially a set of opam packages. I would like to drag the license information for the ReadMe from opam. Almost all packages do have a license field. A notable exception are various OCaml core packages as the following output shows (packages with license field stripped):
Coq$ opam list --columns=name,version,license:
# Packages matching: installed
# Name # Version # License
base-bigarray base
base-threads base
base-unix base
depext transition
ocaml 4.07.1
ocaml-base-compiler 4.07.1
ocaml-config 1
ocaml-secondary-compiler 4.08.1-1
ocamlfind 1.8.1
ocamlfind-secondary 1.8.1
seq base
Would it be possible to fix this, also for older compilers (for Windows compatibility reasons I am currently stuck with 4.07.1)?
In a private communication with @xavierleroy he suggested that opam lint should flag missing license fields - provided we agree that opam packages should have a license field.
For complex projects like Coq Platform, a consistent license field would be of great help for the maintainers and users.
In a private communication with @xavierleroy he suggested that
opam lintshould flag missing license fields - provided we agree that opam packages should have a license field.
I raised this issue separately upstream in https://github.com/ocaml/opam/issues/4598
Thanks for bringing this up @MSoegtropIMC -- I've had this on my todo list for some years now, as establishing the license of a set of opam packages is important for almost any industrial use. The answer isn't quite as simple as just following the license field in the package, so here follows my thoughts:
There are two uses for the licensing metadata:
- a "shallow" check (for example to ensure there is no AGPL code in a product intended to be closed source)
- a "deep" check to satisfy the needs of documentation and acknowledgements (e.g. the BSD advertising clauses and similar)
For the shallow check, the licensing fields are not checked closely by the opam-repository team when we merge PRs, and also doesn't cover the case where a library is multilicensed due to incorporating some modules under another license (not common, but not zero either). In this case, you need to drop into the opam source to verify each package.
For the deep case, you need to drop into the source to get the LICENSE file(s) out of the distribution and copy those into your product distribution acknowledgements. For example Docker for Desktop does this in the acknowledgements pane. Luckily, most modern opam packages work with the topkg conventions and actually name their license files LICENSE.*.
All of this can be mechanised via an opam plugin to automate the:
- license fetching: this needs some heuristics to help lint the claimed license against what's in the sources -- for example, by checking that there are no GPL license fragments present in an opam package that claims to be BSD3. @samoht and I did this for Docker a few years ago and could use that as a guide for such an opam licensing plugin. @djs55, is this code from Docker from Desktop (to build the acknowledgements file) something that you'd consider open sourcing in a gist?
- license linting: this needs to operate against the opam package metadata and try to verify it against the source archive. This is something I'd really like to have in the opam-repo-ci. As a general rule of thumb, if an aspect of package metadata isn't checked either in CI or by a manual maintainer check, there will be some set of packages that have it wrong.
Whether or not the Coq Platform distribution needs this level of licensing due diligence is down to how you plan to distribute it. Given that it is (as far as I know) all open source itself, you probably don't need to go as far as a commercial product does. On the other hand, it's very hard to correct this stuff once a package database grows, so this might be the best time to build some automation.
(I'm CCing @jonludlam as the lead odoc maintainer on this thread as well, since the same question has come up for generating the package information page for odoc/opam packages).
An opam plugin which could perform both the "shallow" and "deep" check would be great. For the deep check maybe it should output a structured list of obligations (that it recognises) like the advertising clause and possibly other things (patent grants?) The Docker Desktop code to extract licenses is a bit basic but open-source already: https://github.com/moby/vpnkit/tree/master/repo . It probably needs updating to opam 2 (!) (IIRC the in-tree repo metadata is still v1 and is auto-upgraded to build)
One though I had is to register at least important packages at (https://www.openhub.net). Many Coq packages are already registered there - not sure about OCaml packages, though. Then in opam we could have an openhub link and check if their analysis agrees with what is in the license field. I don't know if it makes that much sense to redo the work they already did. E.g. afaik they also scan the sources for cut and paste from software with incompatible licenses.
This would also have the effect that the language share of OCaml is properly represented there OpenHub language compare.
That's a great idea, @MSoegtropIMC -- this is not analysis that is unique to OCaml, and it also covers the case of vendored C libraries and other such inclusions. I'll take a look into OpenHub next.
The OpenHub page for OCaml hasn't been updated in more than one year, and lists the license as MIT... The service worked well back in the days it was called Ohloh, but for several years it's been barely functional. I wouldn't expect any timely and accurate data from it.
@xavierleroy : hmm, that's true - a scan rate of substantially less than 1 year won't work. Also looking at their terms (https://community.synopsys.com/s/article/Black-Duck-Open-Hub-Terms-of-Use) I anyway don't think this usage would be compliant to their terms - one would definitely have to ask them upfront.
Is someone aware of a similar service which is open source (open hub doesn't seem to be despite its name).
This might work: (https://github.com/scanoss/engine)
I asked around and also got recommended this: https://github.com/nexB/scancode-toolkit -- trying a quick installation now to see how it looks on opam source archives.
Yes, this looks also useful. I think scanoss and scancode serve a slightly different purpose. Scanoss looks more at the source code to see if some unspecified open source code sneaked in. Scancode seems to concentrate on the licenses. So I would start with Scancode, but the other also looks useful.
Luckily, most modern opam packages work with the topkg conventions and actually name their license files
LICENSE.*.
Even more luckily most modern opam packages work with the odig packaging convention (a notable one that does not is the ocaml package, it's very sad not to be able to odig changes ocaml :-)
This means that if you build your product via an opam switch you can simply odig license --no-pager to get the list of all licenses of the installed packages separated by U+001C.
Other useful commands include odig show license-files (to get the paths to licenses) and odig show -l --show-empty license (to get the license tags, including those empty ones).
@dbuenzli : neat! But as you wrote some core OCaml packages don't not support this, so it brings us back to the original topic of this issue, that some of the core OCaml packages have below standard metadata in opam.
A possible way forward would be:
1.) give scancode a test run and see if it works as advertised
2.) upgrade the core OCaml Opam packages with metadata which is consistent with scancode and the authors understanding of the License (using offline/local runs of scancode)
3.) check / update other opam packages
4.) get scancode into opam CI
5.) See if scanoss agrees with the declared licenses
6.) Possibly have scanoss in CI as well
Alternatively we can just update the opam packages with to our best knowledge correct metadata.
How do we distribute the work?
Initially, if you could give scancode a test and see if it works as advertised, that would be very helpful. I got sidetracked while installing it, but once you've validated that it's suitable, it should be easy to get running on the ocaml.org cluster.
Thanks for starting this discussion. I came across another utility reuse (https://github.com/fsfe/reuse-tool) which mandates license declarations (using SPDX) in each source file (since I'm not a lawyer, I don't really know what should be done for source code, earlier I thought a (non-empty) LICENSE.md file in the repository should be sufficient).
Adapting reuse (or scancode/scanoss) for the opam repository (+CI to check new packages) would be appreciated. At the end, it'd be great if odig license in an opam switch would truthfully report all the licenses (and authors).
I'm catching up on this discussion. I still think a first step would be to inform packagers of the existence of the "License" field (I didn't know about it until this discussion with @MSoegtropIMC) and to gently prod them into filling this field, e.g. via a warning during linting. I'm skeptical that an automated tool can do better than that, but would be happy to be proved wrong.
@xavierleroy : I would say we should work in parallel on both - add a license field to important packages which don't have one and work on infrastructure which does check for obvious errors in the license information. There are a few packages where the license information in opam looks wrong, notably Coq (which afaik is LGPL2.1+ and not LGPL2.1) and GMP (nowadays an immediate dependency of Coq).
and work on infrastructure which does check for obvious errors in the license information.
The CI has been checking the presence of the license field and its compliance with https://spdx.org/licenses/ already for some weeks.
EDIT: I went to double check, and it looks like the lint finds incorrect fields but is not failing with missing license. I will open an issue upstream
Adding licenses to some old packages has started happening when they came up from revdeps or metadata fixes, but it is a big work. Sometimes we had to ask the maintainers because it was not obviously clear from the sources for example
Just wanted to ask what the status is. There seems to be some progress on the CI side, but it doesn't seem to be fully as discussed as yet.
I checked a few opam packages, including the latest one, and none of them have a license field as yet.
The CI checks for this on new submissions now. So all new packages will have licenses, and when this happens we generally ask to add the field to older packages as well.
In addition to the CI checks that @mseri describes, it would be great to figure out the license scanning story. @MSoegtropIMC I'm very short of bandwidth to experiment right now, but feedback on this comment would help advance our knowledge about what to deploy. https://github.com/ocaml/opam-repository/issues/18343#issuecomment-806955334