Faulty license detection
As described in https://github.com/open-vsx/publish-extensions/issues/689, there is something wrong with how we detect licenses.
If you take GitLens and publish it at its latest version (14.0.1), the MIT License gets displayed in the WebUI for it. If you inspect the file itself, you discover that the license is indeed part MIT, but has a header basically stating that it does not apply to all code in the repository. This means that we tell our users the incorrect license.
I am trying to figure out how we do these license checks, but it looks like we're being a bit to lax.
We should rather have false negatives than false positives.
@filiptronicek We're taking a different approach from VS Marketplace. What was the thinking?
I am unaware of what the initial thinking was, but I think it is a nice way to show license types where we can for users to more easily determine what extensions are FOSS and which are not. I think we should re-visit it and see how much value it adds, and whether it's still worthwhile keeping.
@filiptronicek The server searches for the license by using a list of well known open source licenses (MIT, GPL, BSD, etc.). The server doesn't take into account dual licensed extensions. Even if we were able to, I don't know how to concisely name this license (MIT & PLUS?) It's probably better to just add the license file to the Resources section and not display the license name in the header anymore.
@amvanbaren I agree with not showing the license name. If we want to be a bit clever, though, we could use an approach employed by GitHub, which I like: if a license format exactly matches the license file, use it. If it doesn't, just display License with a link to it.[^1]
[^1]: By the way, GitHub uses https://github.com/licensee/licensee
if a license format exactly matches the license file, use it. If it doesn't, just display License with a link to it.
Sounds good. We can call licensee using JRuby.
The other thing is that the ExtensionProcessor assumes that an extension only has 1 license file. It should look for all license files instead.
@amvanbaren This could represent a pretty substantial visible change to our users/adopters. Is there an easy way to determine how may extensions would have what shows for their license change?
@kineticsquid Is there an easy way to determine how may extensions would have what shows for their license change?
No, there isn't really an easy way. The quickest way is to get all licenses for each extension using licensee and compare that to the extension license in the database.
just noticed that rust-analyzer is dual licensed: https://open-vsx.org/extension/rust-lang/rust-analyzer/0.4.1584
@amvanbaren I think that comes from the extension manifest
@amvanbaren @filiptronicek I've just about finished running licensee on all the extensions. I'm having to do it in batches because it makes enough GH API calls that I run into rate limiting issues. Here's what I'm seeing:
In general, we can divide the extensions into three classes:
- An extension has
licensespecified inpackage.json. - An extension has
licensespecified infilesinpackage.json. - An extension has neither of the above specified.
Running licensee on each extension will result in one of the following for each of the above classes:
- No license detected, or unable to run
licenseebecause repo is private and can't be cloned. - License detected and matches what's in
package.json. - License detected that does not match what's in
package.json.
Given the above 3x3 matrix, what's the thinking on how we might use licensee and how that use could impact what shows up on an extension's page?
@kineticsquid Well, I would compare it to the current situation. I don't think much will change, just that the license detection is more accurate.
Licensee is able to detect multiple licenses. If we stick to the 1 license per extension version then we just take the license with the highest confidence score.
On the other hand it's quite a fundamental change if we want to provide all found licenses. Not just for the webui, but also for the database and API endpoints. API consumers like VSCode or VSCodium only expect 1 license, so we also have to keep that in mind.
API consumers like VSCode or VSCodium only expect 1 license
Worst case we hack around it with licenses.join(" & "), so that we display all info we have in the manner clients expect. I don't think they do any format checking.
@filiptronicek It's a link to the license file, not the license names. I guess we can divide an extension's licenses into a main license and other licenses.
I tried a number of different permutations of license files and license entries in package.json to see how our system behaved. By and large, with one exception, it behaved as I would have expected, with a couple of potential bugs (not high priority IMO) E.g.,
- specify
licenseinpackage.jsonthat is not an official SPDX license term and do not include alicensefile. 404 as the license doesn't exist at the spdx.org URL we generate. - attempt to publish an extension and get an error, e.g. no license. Re-package and attempt to publish again and get an error that the version already exists.
The exception: based my observations, with this extension, https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid, I thought that if one specified license in package.json and included a file named license or LICENSE, what would be displayed on the extension's page would be the package.json entry with a link to the license file. I say this because I noticed that in GitLens package.json, the author specifies license and it is not 'MIT'.
Attempting to reproduce this, I took the LICENSE file from GitLens and removed the copyright header. I then included the same package.json entry as GitLens: package.json: "license": "SEE LICENSE IN LICENSE". With this package, 'MIT' gets displayed as the license. That's version 0.0.19: https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid/0.0.19.
In 0.0.21, https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid/0.0.21, I changed the package.json entry to: "license": "My special license", but did not change the license file. In this case, 'My Special License' is displayed.
it seems strange that the package.json entry would affect the license detection.
The above said, I think at a minimum, if an author specifies license in package.json, that's the text we should display and link to the provided license file. It's up to the author/publisher to ensure this information is correct. If an entry is misleading, a user reports it and we investigate.
I'm also wondering if attempting license detection is worth it. E.g. if there is a license file and no entry in package.json we display 'Provided License' and link to the file.
Thinking about this some more, I think this is what open-vsx should do:
- If no
licenseinpackage.json, error on publishing. This happens today. - If no
licenseinpackage.jsonand license file present, do not attempt to understand the license file and instead displayProvided Licenseand link to the file. - If
licenseis specified inpackage.json, display those words. If a license file is present, link to it.
@waynebeaton @mbarbero FYI
@kineticsquid Do you want to apply the same logic for README and CHANGELOG files?
@amvanbaren No, I think the current behavior is fine. To wit, just a message that indicates either No README available or No changelog available.