scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

Add support for external licenses in scans

Open KevinJi22 opened this issue 3 years ago • 12 comments

This is for the 2022 Google Summer of Code. This PR contains all the work that needs to be done for the project.

This adds --additional-license-directory as a command line option in license reindexing. This allows users to specify paths to directories of licenses and rules they'd like to use during license detection, but would not like to add to the ScanCode database of licenses. First, the user must run Scancode with the --reindex-licenses option and they can use --additional-license-directory to add directories of licenses or rules to the index.

This involves adding a new option in licensedcode/plugin_license.py, and this option is used as a parameter in scancode/api.py. In this approach, the licenses and rules contained in these additional directories are combined with the existing licenses and rules in the ScanCode database to produce a single index. The code for this is found in licensedcode/cache.py and the helper methods for loading these licenses and rules are found in licensedcode/models.py.

This commit also includes unit tests to verify that license detection succeeds with additional directories and installed licenses/rules. Part of the setup for the unit test and future tests involves creating a new directory in tests/licensedcode/data that contains sample external licenses used in the unit tests.

Fixes #480

Tasks

  • [x] Reviewed contribution guidelines
  • [x] PR is descriptively titled 📑 and links the original issue above 🔗
  • [x] Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR Run tests locally to check for errors.
  • [x] Commits are in uniquely-named feature branch and has no merge conflicts 📁

KevinJi22 avatar Jun 01 '22 02:06 KevinJi22

There is something in your changes that make the tests fail... I do not think you can just use multiple cached directories. You should use a single cache and then validate, then combine multiple input a single cached index.

pombredanne avatar Jun 06 '22 21:06 pombredanne

@pombredanne I'm actually combining all the licenses and rules in the input directories into a single index here. In load_licenses_from_multiple_dirs, all the licenses (including the ones in the ScanCode database) are combined into a single license database, and the same is done with all the rules. The _CACHED_DIRECTORIES variable is used to determine whether a new index needs to be built. For example, if we've created an index with additional directories A and B, but now we run a scan with additional directories B and C, we'd need to recreate the combined index.

I'll update this code with comments so that this is more clear.

https://github.com/nexB/scancode-toolkit/blob/b1cf4aa16386e447b024b2fd3f5386251560628d/src/licensedcode/cache.py#L182-L195

KevinJi22 avatar Jun 08 '22 03:06 KevinJi22

Thanks for the feedback! This bug was actually due to how os.path.join() works, since the path that you passed in wasn't an absolute path, so the paths were just concatenated in the function build_rule_from_license() in models.py. This meant that the external license wasn't being loaded in as a rule. I added some code in get_license_dirs() that converts the input path to an absolute path, which will hopefully prevent this from happening in the future.

This means that running the scan with the same parameters should work this time.

KevinJi22 avatar Jun 11 '22 03:06 KevinJi22

@KevinJi22 Don't feel compelled to amend your commit and force push it every time you make a change. You can create as many commits as you want on this PR. :slightly_smiling_face:

JonoYang avatar Jun 16 '22 22:06 JonoYang

@JonoYang @pombredanne this newest commit adds basic functionality for using installed licenses in scans. It assumes all packages of licenses start with "licenses" (we can change this later) and it assumes a certain directory structure. The only way to test this is manually. I added a directory containing an example package in tests/licensedcode/data/example_external_licenses/licenses_to_install1, and you can test it like so:

  1. Run pip install tests/licensedcode/data/example_external_licenses/licenses_to_install1 to install the licenses
  2. Run ./scancode -l --json-pp ../temp.json /home/kevin/code/scancode-toolkit/tests/licensedcode/data/plugin_license/installed_licenses/scan
  3. In temp.json, you should see the license Example Installed License 1 being detected. You can also run Scancode before installing the license if you want to verify that the code actually works.

KevinJi22 avatar Jun 26 '22 15:06 KevinJi22

@KevinJi22 let me review all this in details.... a couple comments:

  1. the DCO check is failing... there is likely a commit to amend

  2. for the tests, I suggest that this could be a CI-only test. You could add a job in https://github.com/nexB/scancode-toolkit/blob/develop/azure-pipelines.yml or a GH workflow that would exclusively test your feature roughly this way:

2.1. first start by creating a directory in the tests data that would contain the complete and minimal working plugin that contributes some new licenses 2.2. then create a test job that installs scancode normally and then as an extra setup runs pip to install your plugin from that directory (no need to have a wheel or else publish it) 2.3. finally create a test_xxx.py scripts and with a test that assumes that your plugin is installed and only tests the plugin and run that as part of the job

Beyond this there are a few things that would be great to refine and get a well rounded feature: 3. some documentation including pointing to the test sample plugin from 2.

  1. expand the feature to include not only licenses but also 4.1. license detection rules for these new licenses (the rules for the core licenses would be best kept with thee core licenses, location-wise)
    4.2. possibly also license detection tests such that it becomes easy to have tests for the new licenses and possible rules

  2. with the addition of new licenses, there is likely a need to better report issues with the licenses metadata (and also rules metadata with 4.1) ... there are checks done when an index is created and checks done in the tests... See

  • https://github.com/nexB/scancode-toolkit/blob/a61fe17165fef37fb58a99c7b6a0267093ee33ff/tests/licensedcode/test_detection_validate.py#L50
  • https://github.com/nexB/scancode-toolkit/blob/a61fe17165fef37fb58a99c7b6a0267093ee33ff/tests/licensedcode/test_models.py#L108
  • https://github.com/nexB/scancode-toolkit/blob/a61fe17165fef37fb58a99c7b6a0267093ee33ff/src/licensedcode/models.py#L509
  • https://github.com/nexB/scancode-toolkit/blob/a61fe17165fef37fb58a99c7b6a0267093ee33ff/src/licensedcode/models.py#L794

These tests are not aware of plugins of course.... I wonder if some of then may be worthy of integration in the pre or post indexing checks.... as otherwise we may have some lingering issues in the plugins.... And of course if you move of the checks to indexing time, then there needs to be overall a better reporting of any issues to make the message actionable.

pombredanne avatar Jul 08 '22 22:07 pombredanne

@pombredanne just wanted clarify about this point:

These tests are not aware of plugins of course.... I wonder if some of then may be worthy of integration in the pre or post indexing checks.... as otherwise we may have some lingering issues in the plugins....

Are you saying that we might want to do some of the same checks that are done in the tests in the source code? And these checks would happen when the index is created (possibly with external licenses)?

KevinJi22 avatar Jul 11 '22 15:07 KevinJi22

@pombredanne @JonoYang for 5. with the addition of new licenses, there is likely a need to better report issues with the licenses metadata (and also rules metadata with 4.1) ... there are checks done when an index is created and checks done in the tests..., I have currently not implemented it as a CLI option since the checks will only be run when an index is created. If we are loading the index in from the cached file, the checks also will not run, which seems like the behavior we want. However, I can always add a command line option for maximum user flexibility.

KevinJi22 avatar Jul 24 '22 21:07 KevinJi22

@KevinJi22

After talking to @pombredanne, we'd like to modify the CLI option for external licenses. Currently, we have -dir and --additional_directories as CLI options, which we would like to change.

We want to remove -dir as an option because:

  1. We have no other single-dash option that is more than a single letter (e.g. -c, -i, etc.)
  2. Single-dash options are for options that are more commonly used, we don't expect this option to be used on the same frequency as -i

We want to rename --additional_directories to --external_license_directory because:

  • --additional_directories is too vague and something more explicit should be used to immediately convey what the option does / is for

JonoYang avatar Aug 09 '22 22:08 JonoYang

Actually --external-license-directory would be better for consistency as we never use an underscore in a command line option. Note that Click will translate this to an undescore-separated Python variable name alright

pombredanne avatar Aug 10 '22 09:08 pombredanne

@KevinJi22 I would guess this is now ready for review? and no longer a draft/WIP?

pombredanne avatar Sep 12 '22 22:09 pombredanne

@pombredanne yup! Just edited the title to reflect that

KevinJi22 avatar Sep 12 '22 22:09 KevinJi22

@KevinJi22 we (@pombredanne and me) added the following changes:

CLI Options:

  • various reindexing operations are no longer CLI options
  • we have a seperate script scancode-reindex-licenses which does that
  • only one additional directory of licenses/rules can be added, i.e. this is not a multiple option

Tests:

  • Organize tests data under one single folder.
  • Organize test functions under one single file.
  • Keep three tests, one for additional directory, one for additional plugin, and one for combined testing.
  • Edit the CI file accordingly.

Misc:

  • code refactoring at get_rules_from_multiple_dirs, get_rules_from_multiple_dirs etc
  • external license directories/plugin info is returned in scan header

Remaining work:

  • update how license data (links and other) is returned in api.py
  • also report new attribute is_builtin @pombredanne ^ this will create a lot of test file updates, I was thinking I'll do this update in the main LicenseDetection PR as there's lot's of change in license data reporting already.

Otherwise tests are all green and this is ready AFAIK. Please review. :bow:

AyanSinhaMahapatra avatar Oct 12 '22 21:10 AyanSinhaMahapatra

Remaining work:

  • update how license data (links and other) is returned in api.py
  • also report new attribute is_builtin @pombredanne ^ this will create a lot of test file updates, I was thinking I'll do this update in the main LicenseDetection PR as there's lot's of change in license data reporting already.

Modified the code to add is_builtin flags to license match data when there is an additional license directory or additional license plugins installed. So this work is not remaining anymore.

Ready to review @pombredanne!

Was also wondering if there could be a --only-builtin CLI option in scancode-reindex-licenses to reindex the license freshly without having to delete and reconfigure the environment, but not sure how to uninstall plugins from the python script. @pombredanne is this doable?

AyanSinhaMahapatra avatar Oct 20 '22 18:10 AyanSinhaMahapatra

Was also wondering if there could be a --only-builtin CLI option in scancode-reindex-licenses to reindex the license freshly without having to delete and reconfigure the environment, but not sure how to uninstall plugins from the python script. @pombredanne is this doable?

Sure, but remember that each new added option may be a sign of our failure to automatically do the right thing without options :)

pombredanne avatar Oct 21 '22 12:10 pombredanne

Here some extra feedback. I think we should refactor this a bit more such that indexing licenses and rules just takes a list of directories. Everything else should IMHO just use the things already loaded. And we can simplify the detection of builtins based on a list of builtin licenses... This would help keep the code tidier IMHO. @AyanSinhaMahapatra @KevinJi22 let's chat at your convenience

I created https://github.com/nexB/scancode-toolkit/issues/3137 as a follow up!

pombredanne avatar Oct 28 '22 15:10 pombredanne