Spdx-Java-Library icon indicating copy to clipboard operation
Spdx-Java-Library copied to clipboard

Official GPL-2.0 license text not recognized

Open sdheh opened this issue 1 year ago • 9 comments

For the license text https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt I get the following:

System.out.println(Arrays.toString(LicenseCompareHelper.matchingStandardLicenseIds(licenseText)));
System.out.println(LicenseCompareHelper.matchingStandardLicenseIdsWithinText(licenseText));

outputs

[]
[GPL-2.0, GPL-2.0-or-later, GPL-2.0-only]

The two outputs should be the same since the GPL-2.0 license spans the whole file. Tested with version 1.1.11 This problem is similar to 217

sdheh avatar Jul 22 '24 15:07 sdheh

PR #236 includes unit tests that reproduce this problem, albeit with other license texts - the issue is not limited to just GPL-2.0, or indeed even just GPL family licenses.

See also #234.

pmonks avatar Jul 22 '24 16:07 pmonks

I figured out a problem that could explain this case. I think the tokenization does not work properly. Example:

String license1 = "<one";
String template1 = "<<beginOptional>><<<endOptional>>one";
String license2 = "< one";
System.out.println("template1, license1: " + LicenseCompareHelper.isTextMatchingTemplate(template1, license1).getDifferenceMessage());
System.out.println("template1, license2: " + LicenseCompareHelper.isTextMatchingTemplate(template1, license2).getDifferenceMessage());

Returns

template1, license1: Normal text of license does not match at end of text when comparing to template text "one
".  Last optional text was not found due to the optional difference: 
	Normal text of license does not match at end of text when comparing to template text "<"
template1, license2: No difference found

When I debug I see that for the first case in org.spdx.utility.compare.CompareTemplateOutputHandler.compareText the matchTokens parameter is ["<one"]. I think it should instead be ["<", "one"] like in the second case.

Also if I remove all < and > from the https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt text ( gpl-2.0-removed-angle-brackets.txt ) or if I add a space before and after every < and > ( gpl-2.0-spaces-between-angle-brackets-and-text.txt ) I get the following result for the code in the issue description:

[GPL-2.0, GPL-2.0-only]
[GPL-2.0, GPL-2.0-or-later, GPL-2.0-only]

sdheh avatar Jul 23 '24 09:07 sdheh

Thanks @sdheh for the analysis! I agree, the tokenization is the issue. I'm still working on the 3.0 update, so I won't have much time over the next week or so to look for a fix, but if you want to create a pull request I can review / merge.

goneall avatar Jul 23 '24 18:07 goneall

I ran into this issue as well and believe it affects several templates where the optional text would not be tokenized separately. For the above case I did experiment with the following change but am still learning the code base and am unclear on the total impact:

+++ b/src/main/java/org/spdx/utility/compare/CompareTemplateOutputHandler.java
@@ -247,7 +247,13 @@ public class CompareTemplateOutputHandler implements
                                                nextToken = sub.match(matchTokens, nextToken, endToken, originalText, 
                                                                optionalDifference, tokenToLocation);
                                                if (nextToken < 0) {
-                                                       if (!ignoreOptionalDifferences) {
+                                                       // Handle optional text tokenized with license text to handle single char optional text
+                                                       String optionalText = sub.getText();
+                                                       String matchText = LicenseTextHelper.getTokenAt(matchTokens, startToken);
+                                                       if (matchText != null && optionalText != null && matchText.length() > optionalText.length() && matchText.startsWith(optionalText)) {
+                                                               // remove optional text from start of match token
+                                                               matchTokens[startToken] = matchText.substring(optionalText.length());
+                                                       } else if (!ignoreOptionalDifferences) {
                                                                setLastOptionalDifference(optionalDifference);
                                                        }                                       
                                                        return startToken;      // the optional text didn't match, just return the start token

Modifying the parsed tokens feels wrong but could not sort out an easy way to adjust the tokenization of the license text being matched to the template.

douglasclarke avatar Oct 02 '24 14:10 douglasclarke

@douglasclarke this was fixed in PR #249, which is merged but awaiting release.

pmonks avatar Oct 02 '24 14:10 pmonks

@pmonks thanks. I believe I am testing with the latest code on master and still failing to get this case to work:

		String template = "<<beginOptional>><<<endOptional>>one";
		String sample = "<one";
		
		DifferenceDescription diff = LicenseCompareHelper.isTextMatchingTemplate(template, sample);

douglasclarke avatar Oct 02 '24 15:10 douglasclarke

@pmonks thanks. I believe I am testing with the latest code on master and still failing to get this case to work:

		String template = "<<beginOptional>><<<endOptional>>one";
		String sample = "<one";
		
		DifferenceDescription diff = LicenseCompareHelper.isTextMatchingTemplate(template, sample);

Thanks @douglasclarke for reporting this along with the specific example.

This may be a separate issue - let's open a new issue and close this one as resolved since the prior examples seem to be fixed with the latest.

goneall avatar Oct 02 '24 20:10 goneall

Just a head's up that this license list XML issue will also cause matching of (recent) GPL-2.0 texts to fail, but not because of anything wrong in Spdx-Java-Library. I only mention it as sometimes it can be time-consuming to discern whether a matching bug is in the code, the templates, or (as in this case) a breaking change to the official license text by the license publisher.

pmonks avatar Oct 07 '24 18:10 pmonks

This will be resolved once https://github.com/spdx/license-list-XML/pull/2632 is merged. Merging the PR depends on generating a new version of the license list publisher - which I'm hoping to get out in the next week or so.

goneall avatar Apr 16 '25 22:04 goneall