openvsx icon indicating copy to clipboard operation
openvsx copied to clipboard

Faulty license detection

Open filiptronicek opened this issue 2 years ago • 18 comments

As described in https://github.com/open-vsx/publish-extensions/issues/689, there is something wrong with how we detect licenses.

If you take GitLens and publish it at its latest version (14.0.1), the MIT License gets displayed in the WebUI for it. If you inspect the file itself, you discover that the license is indeed part MIT, but has a header basically stating that it does not apply to all code in the repository. This means that we tell our users the incorrect license.

I am trying to figure out how we do these license checks, but it looks like we're being a bit to lax.

We should rather have false negatives than false positives.

image image

filiptronicek avatar Jul 12 '23 09:07 filiptronicek

@filiptronicek We're taking a different approach from VS Marketplace. What was the thinking?

kineticsquid avatar Jul 12 '23 16:07 kineticsquid

I am unaware of what the initial thinking was, but I think it is a nice way to show license types where we can for users to more easily determine what extensions are FOSS and which are not. I think we should re-visit it and see how much value it adds, and whether it's still worthwhile keeping.

filiptronicek avatar Jul 12 '23 22:07 filiptronicek

@filiptronicek The server searches for the license by using a list of well known open source licenses (MIT, GPL, BSD, etc.). The server doesn't take into account dual licensed extensions. Even if we were able to, I don't know how to concisely name this license (MIT & PLUS?) It's probably better to just add the license file to the Resources section and not display the license name in the header anymore.

amvanbaren avatar Jul 13 '23 08:07 amvanbaren

@amvanbaren I agree with not showing the license name. If we want to be a bit clever, though, we could use an approach employed by GitHub, which I like: if a license format exactly matches the license file, use it. If it doesn't, just display License with a link to it.[^1]

image image
image

[^1]: By the way, GitHub uses https://github.com/licensee/licensee

filiptronicek avatar Jul 13 '23 08:07 filiptronicek

if a license format exactly matches the license file, use it. If it doesn't, just display License with a link to it.

Sounds good. We can call licensee using JRuby.

The other thing is that the ExtensionProcessor assumes that an extension only has 1 license file. It should look for all license files instead.

amvanbaren avatar Jul 13 '23 09:07 amvanbaren

@amvanbaren This could represent a pretty substantial visible change to our users/adopters. Is there an easy way to determine how may extensions would have what shows for their license change?

kineticsquid avatar Jul 13 '23 13:07 kineticsquid

@kineticsquid Is there an easy way to determine how may extensions would have what shows for their license change?

No, there isn't really an easy way. The quickest way is to get all licenses for each extension using licensee and compare that to the extension license in the database.

amvanbaren avatar Jul 13 '23 13:07 amvanbaren

just noticed that rust-analyzer is dual licensed: https://open-vsx.org/extension/rust-lang/rust-analyzer/0.4.1584

amvanbaren avatar Jul 13 '23 15:07 amvanbaren

@amvanbaren I think that comes from the extension manifest

filiptronicek avatar Jul 13 '23 18:07 filiptronicek

@amvanbaren @filiptronicek I've just about finished running licensee on all the extensions. I'm having to do it in batches because it makes enough GH API calls that I run into rate limiting issues. Here's what I'm seeing:

In general, we can divide the extensions into three classes:

  1. An extension has license specified in package.json.
  2. An extension has license specified in files in package.json.
  3. An extension has neither of the above specified.

Running licensee on each extension will result in one of the following for each of the above classes:

  1. No license detected, or unable to run licensee because repo is private and can't be cloned.
  2. License detected and matches what's in package.json.
  3. License detected that does not match what's in package.json.

Given the above 3x3 matrix, what's the thinking on how we might use licensee and how that use could impact what shows up on an extension's page?

kineticsquid avatar Jul 31 '23 15:07 kineticsquid

@kineticsquid Well, I would compare it to the current situation. I don't think much will change, just that the license detection is more accurate.

Licensee is able to detect multiple licenses. If we stick to the 1 license per extension version then we just take the license with the highest confidence score.

On the other hand it's quite a fundamental change if we want to provide all found licenses. Not just for the webui, but also for the database and API endpoints. API consumers like VSCode or VSCodium only expect 1 license, so we also have to keep that in mind.

amvanbaren avatar Aug 01 '23 15:08 amvanbaren

API consumers like VSCode or VSCodium only expect 1 license

Worst case we hack around it with licenses.join(" & "), so that we display all info we have in the manner clients expect. I don't think they do any format checking.

filiptronicek avatar Aug 01 '23 17:08 filiptronicek

@filiptronicek It's a link to the license file, not the license names. I guess we can divide an extension's licenses into a main license and other licenses.

amvanbaren avatar Aug 01 '23 18:08 amvanbaren

I tried a number of different permutations of license files and license entries in package.json to see how our system behaved. By and large, with one exception, it behaved as I would have expected, with a couple of potential bugs (not high priority IMO) E.g.,

  • specify license in package.json that is not an official SPDX license term and do not include a license file. 404 as the license doesn't exist at the spdx.org URL we generate.
  • attempt to publish an extension and get an error, e.g. no license. Re-package and attempt to publish again and get an error that the version already exists.

The exception: based my observations, with this extension, https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid, I thought that if one specified license in package.json and included a file named license or LICENSE, what would be displayed on the extension's page would be the package.json entry with a link to the license file. I say this because I noticed that in GitLens package.json, the author specifies license and it is not 'MIT'.

Attempting to reproduce this, I took the LICENSE file from GitLens and removed the copyright header. I then included the same package.json entry as GitLens: package.json: "license": "SEE LICENSE IN LICENSE". With this package, 'MIT' gets displayed as the license. That's version 0.0.19: https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid/0.0.19.

In 0.0.21, https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid/0.0.21, I changed the package.json entry to: "license": "My special license", but did not change the license file. In this case, 'My Special License' is displayed.

it seems strange that the package.json entry would affect the license detection.

kineticsquid avatar Aug 03 '23 17:08 kineticsquid

The above said, I think at a minimum, if an author specifies license in package.json, that's the text we should display and link to the provided license file. It's up to the author/publisher to ensure this information is correct. If an entry is misleading, a user reports it and we investigate.

I'm also wondering if attempting license detection is worth it. E.g. if there is a license file and no entry in package.json we display 'Provided License' and link to the file.

kineticsquid avatar Aug 03 '23 17:08 kineticsquid

Thinking about this some more, I think this is what open-vsx should do:

  • If no license in package.json, error on publishing. This happens today.
  • If no license in package.json and license file present, do not attempt to understand the license file and instead display Provided License and link to the file.
  • If license is specified in package.json, display those words. If a license file is present, link to it.

@waynebeaton @mbarbero FYI

kineticsquid avatar Aug 10 '23 14:08 kineticsquid

@kineticsquid Do you want to apply the same logic for README and CHANGELOG files?

amvanbaren avatar Sep 05 '23 11:09 amvanbaren

@amvanbaren No, I think the current behavior is fine. To wit, just a message that indicates either No README available or No changelog available.

kineticsquid avatar Sep 05 '23 18:09 kineticsquid