licensee icon indicating copy to clipboard operation
licensee copied to clipboard

Doesn't choose license from standard LICENSE file if any other file with the word license appears, eg. check_license_exists.sh

Open HariSekhon opened this issue 4 years ago • 11 comments

Describe the bug

When licensee detects two files with the word license in it, it gets confused and doesn't use the more obvious one, there is no priority for the standard LICENSE file so no license assertion is made.

Steps to reproduce the behavior

$ git clone https://github.com/HariSekhon/DevOps-Bash-tools.git

$ cd DevOps-Bash-tools

$ licensee detect 
License:        NOASSERTION
Matched files:  LICENSE, check_license_exists.sh
LICENSE:
  Content hash:  4c2c763d64bbc7ef2e58b0ec6d06d90cee9755c9
  Attribution:   Copyright 2016 Hari Sekhon
  Confidence:    100.00%
  Matcher:       Licensee::Matchers::Exact
  License:       MIT
check_license_exists.sh:
  Content hash:  0c66256c94dbbe7d947c7a9957bb79dd92960e0d
  License:       NOASSERTION
  Closest non-matching licenses:
    Unlicense similarity:     17.54%
    BSD-2-Clause similarity:  16.33%
    BSL-1.0 similarity:       14.29%

Expected behavior

Expected it to determine the MIT license from the standard LICENSE file.

HariSekhon avatar Feb 15 '22 12:02 HariSekhon

Licensee intentionally matches files with additional words because sometimes actual license files have additional words in their names. I imagine we could add .sh to https://github.com/licensee/licensee/blob/06f5719b119513e00af73f4a578ed1ec0bc53d16/lib/licensee/project_files/license_file.rb#L17 to exclude shell scripts with license in their filenames though.

mlinksva avatar Feb 15 '22 20:02 mlinksva

@mlinksva that ext regex list could start getting long as you may then need to also exclude .py, .pl, .rb, .groovy and many other languages which might have code with the word license in the filename... such as check_license_expiry.pl or update_license_key.pl or something...

My suggestion would be to prioritise the LICENSE file if one exists as it's hard to imagine a scenario with an explicit license file of that convention being superceded by another file which happens to have license somewhere in the filename of a script or program.

A precedence / priority of file naming conventions would solve this in a simpler way, you could even stop at the first filename that matches a given ordered list and this could be clearly documented as to the expected behaviour.

HariSekhon avatar Feb 16 '22 12:02 HariSekhon

Ignoring random file with license in the name if a more standard named file exists in the project sounds reasonable. There's already some prioritizaiton of some filenames (see the linked file above) though it's just used as a way to organize the various regexes and doesn't factor into what files are looked at.

Implementing will require a bit of thought, leaving issue open.

mlinksva avatar Feb 18 '22 04:02 mlinksva

I've implemented this sort of file priority before, usually I just have an ordered list of standard filenames in priority order and iterate each one, stopping at the first found... not exactly an advanced algorithm but it benefits from speed of earlier termination and not having to evaluate lower priority or random matching files that would be discarded anyway.

The only thing one has to choose is the precendence order of standard filenames, but I believe it should be fairly obvious that LICENSE is high in the list before it starts to search for random matched files which will include some_random_license_check_program.pl and similar.

HariSekhon avatar Feb 18 '22 11:02 HariSekhon

🤔 I don't see any reason (other than administrative complexity) of ignore-listing the top N 100% always code file extensions, as I don't expect, e.g., a license.php file to contain the project's license intention. The other route would be to create a manually curated list of known license extensions (including no extension), but given the project's history, we've generally err'd on the side of false positives, rather than false negatives.

One other idea would be to introduce a new .licensee-ignore or similar pattern similar to .gitignore, that if present, Licensee would exclude those files from its detection. That would be a clear indication on the part of the project maintainer that the listed files did not manifest licensing intent. It would require a small change on the github.com push logic side, but may reduce the overall maintenance burden vs. maintaining an allow/block list of extensions.

benbalter avatar Feb 25 '22 21:02 benbalter

I have a similar problem related to this, see the following output:

licensee detect node_modules/@elastic/elasticsearch
License:        NOASSERTION
Matched files:  LICENSE, README.md, package.json
LICENSE:
  Content hash:  bec905d850e7f5dc2e2db78a950d4a9db560a0b8
  Confidence:    100.00%
  Matcher:       Licensee::Matchers::Exact
  License:       Apache-2.0
README.md:
  Content hash:  fb751a4806d36d92000a961e119a503a88692fa6
  License:       NOASSERTION
  Closest non-matching licenses:
    WTFPL similarity:  7.34%
    0BSD similarity:   3.83%
    ISC similarity:    3.40%
package.json:
  Confidence:  90.00%
  Matcher:     Licensee::Matchers::NpmBower
  License:     Apache-2.0
licensee detect node_modules/color-convert
License:        NOASSERTION
Matched files:  LICENSE, README.md, package.json
LICENSE:
  Content hash:  4c2c763d64bbc7ef2e58b0ec6d06d90cee9755c9
  Attribution:   Copyright (c) 2011-2016 Heather Arthur <[email protected]>
  Confidence:    100.00%
  Matcher:       Licensee::Matchers::Exact
  License:       MIT
README.md:
  Content hash:  da39a3ee5e6b4b0d3255bfef95601890afd80709
  Confidence:    100.00%
  Matcher:       Licensee::Matchers::Copyright
  License:       NONE
package.json:
  Confidence:  90.00%
  Matcher:     Licensee::Matchers::NpmBower
  License:     MIT

It should detect Apache-2.0 for elasticsearch, but the low confidence matches from the README confuse it. It would be nice to be able to configure an ignore threshold, at least for the README matcher.

I'm not sure what would be the best solution for the color-convert issue. I understand that it's a problem if multiple sources disagree on a license, but in theory more source should also help detection.

villelahdenvuo avatar Feb 28 '22 09:02 villelahdenvuo

elastic/elasticsearch-js

This is the intended behavior when the detect_readme option is enabled. Licensee is properly detecting that there is a License heading within the README, but is unable to parse the human-readable licensing intention in the README. If you submit a pull request to update the license reference to the correct license name, Apache license 2.0, it should detect the license reference as expected.

Qix-/color-convert

This could be considered a bug, and you're welcome to open a separate issue if you'd like to discuss the behavior further. In short, Licensee is matching both the copyright and reference matchers, but https://github.com/licensee/licensee/blob/6b4871a3fc8854486f6af792cad5017cf69ae346/lib/licensee/project_files/project_file.rb#L70 assumes that only one match will ever exist. In theory, we should favor the higher confidence match (in this case, the copyright matcher), but in practice, we should probably only use the copyright matcher as a last resort if another matcher matches.

benbalter avatar Feb 28 '22 16:02 benbalter

@benbalter I understand your point about the README match, but I think it would be very useful to be able to set an ignore_threshold as well as the current confidence_threshold. It could default to 0, so it wouldn't be a breaking change. That way if I set the ignore_threshold to let's say 10, it would ignore the fuzzy README matches, but I can still benefit from README matches if some projects don't offer other license details.

I agree that the defaults should be strict, but it would be nice to be able to tweak the detection accuracy if needed. I can open another issue for a feature proposal if you think it's something worth doing.

I can open a new ticket about the copyright matcher issue.

villelahdenvuo avatar Feb 28 '22 16:02 villelahdenvuo

One other idea would be to introduce a new .licensee-ignore or similar pattern similar to .gitignore, that if present, Licensee would exclude those files from its detection.

To further this discussion, I spiked out that functionality in https://github.com/licensee/licensee/pull/545 and would appreciate feedback on the approach as a potential long-term and self-serve solution for false positives due to license-like filenames.

benbalter avatar Mar 25 '22 22:03 benbalter

#534 #545 #490 #568 #569 #552

ghost avatar Jun 12 '22 10:06 ghost

#490

ghost avatar Jun 12 '22 10:06 ghost