flit icon indicating copy to clipboard operation
flit copied to clipboard

License classifier: emit a warning instead of raising an error?

Open DimitriPapadopoulos opened this issue 5 months ago • 10 comments

Part of the ecosystem has not yet implemented PEP 639 (pip-licenses) and projects might need to keep the License classifier in addition to the license key in pyproject.toml, for compatibility with such tools.

While setuptools just emits a warning (see https://github.com/pypa/setuptools/issues/4938), flit raises an error. Perhaps that's a bit too much? How about a transition period?

DimitriPapadopoulos avatar Jul 31 '25 20:07 DimitriPapadopoulos

It sounds like pip-licenses is more or less unmaintained, and there's a fork pip-licenses-cli which does support PEP 639 license expressions. I haven't used either, but it sounds like this is about collecting license information from other people's projects. At this point a lot of projects have probably moved to the new format and removed the classifiers already, so downgrading the error to a warning isn't going to make much, if any, difference.

Are you aware of other tools that are broken by this? I don't mind making it a warning if there's a decent reason to do so, but pip-licenses doesn't seem like a compelling case to me.

(In case it's not clear, you can still use the classifiers if you don't specify a new-style license expression, it's only mixing the two that triggers an error)

takluyver avatar Aug 01 '25 13:08 takluyver

Here is the (initial) background of this issue:

  • https://github.com/astral-sh/ruff/pull/19599

It points to these issues:

  • https://github.com/python/typing_extensions/issues/576
  • https://github.com/python/typing_extensions/issues/563
  • https://github.com/python/typing_extensions/issues/562
  • https://github.com/python/typing_extensions/issues/559#issuecomment-2755026495
  • https://github.com/python/typing_extensions/pull/584

These issues suggest other tools might lack (or have lacked) PEP 639 support:

Perhaps it's not worth "supporting" a handful tools that will hopefully be updated soon (or disappear). Or as you suggest, avoid new-style license expression if you do need to take these tools into account.

DimitriPapadopoulos avatar Aug 01 '25 14:08 DimitriPapadopoulos

The Google licensecheck tool scans for the full text contents of licenses and does some fuzzy matching to identify familiar licenses - at least that's what I understand from a brief glance. There's a reference in one of the typing_extensions issues to a different LicenseCheck tool, specific to Python, which has already added PEP 639 support, so I think the Google tool of the same name is a red herring.

It's not clear to me how scancode-toolkit works, but I can't see any issues about PEP 639, which suggests it doesn't need classifiers specifically. Even if it does, you'd run into the same issue that you need to persuade all the packages you use to keep using the deprecated classifier, which doesn't seem realistic.

I'll leave this open for now, if it turns out disallowing the combination of old & new is causing real practical problems beyond one unmaintained tool, we can reevaluate. Otherwise, let's keep moving forwards. 🙂

takluyver avatar Aug 01 '25 14:08 takluyver

I think that going a nontrivial distance out of the way in order to forbid a particular text string in a field designed to take free-form text strings, that's been used for a very long time, on the basis that one isn't cool if one still uses it, is needlessly prescriptive. In particular since it appears to be the ONLY restriction that flit_core has on this free-form text string.

It's a fatal error which states that a thing is deprecated. Fatal errors are the exact opposite of a deprecation.

The sharp edges of this are considerable. But it's less about the specifics and more about the guiding design that would lead this issue to occur in the first place. The python packaging community is notorious for a high rate of churn. Interoperability standards aren't exempt from this but they should be. And by making it a fatal error to both use new technology and keep the old approach working at the same time for people who still have workflows depending on it, flit contributes to an ecosystem where standards represent cliffs. It is impossible to support new use cases without breaking old use cases. All software in the ecosystem is required to move in lockstep because if you update one thing all on its own you break everything else. No consideration for staged rollouts.

It is impossible to thoughtfully consider a PR from the community adding support for a new feature without taking sides against people relying on the old feature.

This isn't like changing the way one interacts with flit itself, where you can say, okay, just pin the version of the build backend you're compatible with in order to produce a wheel correctly. This is flit making it impossible to describe a valid, standards-compliant wheel, because the standard advises against it but doesn't forbid it.

It's not clear to me how scancode-toolkit works, but I can't see any issues about PEP 639, which suggests it doesn't need classifiers specifically. Even if it does, you'd run into the same issue that you need to persuade all the packages you use to keep using the deprecated classifier, which doesn't seem realistic.

Surely that's precisely the problem -- there is no issue about PEP 639 which means it must be depending on classifiers in order to accomplish its goals.

Hence it is not possible to add support for PEP 639 to a package without being forced against your will to drop support for scancode-toolkit -- irrespective of whether you as a project author wish to or care about scancode-toolkit, you've lost the ability to make a choice.

Maybe as a project you do care about this use case, and you know that you have users that by an amazing coincidence only depend on you and a handful of other projects with equal interest in compatibility -- doesn't matter, you can't do anything other than switch build backends.

eli-schwartz avatar Aug 01 '25 20:08 eli-schwartz

... I should probably clarify. I think the probably-correct choice for projects to make in this scenario is to decline to support PEP 639 at all, but continue to use flit. Flit is, overall, still the best existing build backend out there, and I don't want people to move away from it. :D

eli-schwartz avatar Aug 01 '25 21:08 eli-schwartz

Note that the old project.license.text field never worked as intended. The license text was just silently discarded. From that perspective I believe it makes a lot of sense to now emit a warning / error if it is used incorrectly.

Users who don't want to adopt PEP 639 just yet are free to continue using the license classifiers. I don't see them being removed soon. The warning is just in place to guide the majority towards license expressions which can often better articulate the license intent. Even with the shortcomings which are still being discussed over in the Python discuss forum.

cdce8p avatar Aug 01 '25 21:08 cdce8p

there is no issue about PEP 639 which means [scancode-toolkit] must be depending on classifiers

It's not specific to Python, and it looks like a pretty big & popular project. It seems much more plausible that it detects licenses without relying on specifics of Python packaging metadata than that it's broken by PEP 639 and no-one has opened an issue about it yet. By all means prove me wrong - Flit 3.12 is published with only a license expression and no classifier, for instance - but until then I don't think we can count it as incompatible.

Again, if someone shows a broader practical harm here, I'm happy to turn the error into a warning. For that matter, if someone wants to make a PR doing that and no-one objects for a few days, I'm OK with it anyway. But so far the practical impact seems to be limited to one unmaintained project, which is gradually getting broken as people migrate to the new format anyway, so it seems like a waste of time to me.

takluyver avatar Aug 03 '25 09:08 takluyver

How does ScanCode detect licenses?

For license detection, ScanCode uses a (large) number of license texts and license detection ‘rules’ that are compiled in a search index. When scanning, the text of the target file is extracted and used to query the license search index and find license matches.

For copyright detection, ScanCode uses a grammar that defines the most common and less common forms of copyright statements. When scanning, the target file text is extracted and ‘parsed’ with this grammar to extract copyright statements.

ScanCode-Toolkit performs the scan on a codebase in the following steps :

  1. Collect an inventory of the code files and classify the code using file types,
  2. Extract files from any archive using a general purpose extractor
  3. Extract texts from binary files if needed
  4. Use an extensible rules engine to detect open source license text and notices
  5. Use a specialized parser to capture copyright statements
  6. Identify packaged code and collect metadata from packages
  7. Report the results in the formats of your choice (JSON, CSV, etc.) for integration with other tools

Not much information specific to Python I'm afraid. However, due to the extensive number of steps and methods to retrieve licences, I find it hard to believe the software is failing solely due to the introduction of PEP 639.

Not sure how to interpret the output of ScanCode on typing_extensions:

Testing ScanCode on typing_extensions
$ # lines 13 and 14 of pyproject.toml contain licensing information
$ awk 'NR >= 13 && NR <= 14' typing_extensions/pyproject.toml 
license = "PSF-2.0"
license-files = ["LICENSE"]
$ 
$ scancode --license --json-pp OUTPUT.json typing_extensions
Setup plugins...
Collect file inventory...
Scan files for: licenses with 7 process(es)...
[####################] 48                                    
Scanning done.
Summary:        licenses with 7 process(es)
Errors count:   0
Scan Speed:     2.15 files/sec. 
Initial counts: 31 resource(s): 24 file(s) and 7 directorie(s) 
Final counts:   31 resource(s): 24 file(s) and 7 directorie(s) 
Timings:
  scan_start: 2025-08-03T101802.709826
  scan_end:   2025-08-03T101816.684158
  setup_scan:licenses: 2.81s
  setup: 2.81s
  scan: 11.15s
  total: 13.98s
$ 
$ cat OUTPUT.json 
[...]
      "reference_matches": [
        {
          "license_expression": "psf-2.0",
          "license_expression_spdx": "PSF-2.0",
          "from_file": "typing_extensions/pyproject.toml",
          "start_line": 13,
          "end_line": 13,
          "matcher": "2-aho",
          "score": 100.0,
          "matched_length": 4,
          "match_coverage": 100.0,
          "rule_relevance": 100,
          "rule_identifier": "psf-2.0_6.RULE",
          "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/psf-2.0_6.RULE"
        },
        {
          "license_expression": "unknown-license-reference",
          "license_expression_spdx": "LicenseRef-scancode-unknown-license-reference",
          "from_file": "typing_extensions/pyproject.toml",
          "start_line": 14,
          "end_line": 14,
          "matcher": "2-aho",
          "score": 100.0,
          "matched_length": 3,
          "match_coverage": 100.0,
          "rule_relevance": 100,
          "rule_identifier": "unknown-license-reference_386.RULE",
          "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/unknown-license-reference_386.RULE"
        },
[...]
$ 

Does it fail on line 14 but not line 13? Looking at src/licensedcode/data/rules/unknown-license-reference_386.RULE doesn't really help interpret the output.

Also, I cannot find any issue related to typing_extensions: https://github.com/aboutcode-org/scancode-toolkit/issues

DimitriPapadopoulos avatar Aug 03 '25 10:08 DimitriPapadopoulos

That looks like it's noticing a pattern like license.*=, correctly interpreting the SPDX license expression, but not really knowing what to do with the reference to the license file. I would presume that general scenario is pretty common across languages, so users are probably used to interpreting that output to mean it's PSF-2.0. Perhaps there's even another layer of logic somewhere that can simplify it automatically. Adding the classifier as well might give it another PSF match, but it doesn't look like it would get rid of the 'unknown' match.

takluyver avatar Aug 03 '25 11:08 takluyver

Being a user of SCTK (scancode-toolkit) and having scanned multiple hundreds of packages with it, I can confirm that the LicenseRef-scancode-unknown-license-reference SPDX identifiers are quite common for such references and something I regularly discard/improve in corresponding post-processing steps. It does not matter which build backend (or even which programming language) you are working with. The detailed JSON output shows which matcher and rules were responsible for the report. Scientific studies/papers comparing different license scanning tools usually state that the output of SCTK tries to cover as much as possible (with the risk of false positives to sort out later) instead of potentially missing important stuff (which it might still do), while other tools might omit these, but do not see certain additional valid ones as well.

Regarding pip-licenses: Yes, this package is more or less unmaintained and has had several requests and PRs for PEP 639 support, while still being widely used. pip-licenses-cli (together with its backend pip-licenses-lib) got some traction lately and is an actively maintained fork (including - besides other improvements - support for PEP 639). (Disclaimer: I am the maintainer of the pip-licenses-[cli|lib] packages.)

stefan6419846 avatar Sep 12 '25 10:09 stefan6419846