license-list-XML non-breaking spaces present in several text versions of licenses

non-breaking spaces present in several text versions of licenses

Open jamin-aws-ospo opened this issue 1 year ago • 18 comments

I've noticed that non-breaking spaces (and other non-printable characters) are present in several of the text versions of licenses.

non-breaking space (M-BM- as output by cat -A, or hex 'A0'):

$ grep -lP '\xa0' spdx/license-list-data/text/*
spdx/license-list-data/text/APL-1.0.txt
spdx/license-list-data/text/GD.txt
spdx/license-list-data/text/IBM-pibs.txt
spdx/license-list-data/text/LAL-1.2.txt
spdx/license-list-data/text/LiLiQ-P-1.1.txt
spdx/license-list-data/text/LiLiQ-R-1.1.txt
spdx/license-list-data/text/LiLiQ-Rplus-1.1.txt
spdx/license-list-data/text/NCGL-UK-2.0.txt
spdx/license-list-data/text/NPL-1.1.txt
spdx/license-list-data/text/OCCT-PL.txt
spdx/license-list-data/text/OGL-UK-1.0.txt
spdx/license-list-data/text/SHL-2.0.txt
spdx/license-list-data/text/SHL-2.1.txt
spdx/license-list-data/text/SISSL.txt
spdx/license-list-data/text/W3C-19980720.txt

Another, unknown non-printing sequence (M-bM-^@M- as output by cat -A) is present in:

spdx/license-list-data/text/APL-1.0.txt
spdx/license-list-data/text/OGL-UK-1.0.txt
spdx/license-list-data/text/OSET-PL-2.1.txt

Mar 23 '23 20:03 jamin-aws-ospo

Are these due to some kind of encoding issue? These are copied from the test license text which are typically copy/pasted from the original license.

@jamin-aws-ospo Can you check of the origin test licenses have the same issues? That would help isolate if this is a source problem or a tooling problem.

Mar 23 '23 20:03 goneall

I'd be happy to check them, but I'm not sure what source(s) you consider to be the original.

For instance, what is the original (or official source of the) APL-1.0?

Mar 23 '23 20:03 jamin-aws-ospo

I'd be happy to check them, but I'm not sure what source(s) you consider to be the original.

I would just look at the files in this test file directory and see if you're seeing the same issue: https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator I'm not sure I would actually call these file "original", but they are the source files used to populate the spdx/license-list-data/text/ directory. The text in the test directory was submitted when the license XML files were submitted. They came from a lot of different sources. It would be up to the submitter to decide what the original (or official) source is. There are cases where there is no determinate original text version of the license.

If you don't see the same issue in the test directory, I would suspect an issue in the tool that generated the license list data.

Mar 23 '23 21:03 goneall

The issue is present with each of the files listed in the referenced test file directory.

Mar 23 '23 21:03 jamin-aws-ospo

Thanks @jamin-aws-ospo for the analysis.

It looks like this is an issue with the source of the text files.

If we want to clean these up, we can just do a pull request on the source files.

To prevent future issues like this, we can add a check to the CI for this repo to check for unprintable characters.

Just as an FYI - the SPDX license matcher utility handles many of the unprintable characters by normalizing them to a space.

Mar 24 '23 03:03 goneall

I believe the text files should be cleaned/purged of non-printing characters. I also believe that (like myself) others will use the text files as direct inputs/reference for other license based tooling, rather than using the existing license matching utility.

Mar 24 '23 14:03 jamin-aws-ospo

I found this issue because of differences between what different languages consider to be whitespace. Seems that not all languages consider non-breaking space to be part of the POSIX space regular expression character class.

Mar 24 '23 15:03 jamin-aws-ospo

I believe the text files should be cleaned/purged of non-printing characters. I also believe that (like myself) others will use the text files as direct inputs/reference for other license based tooling, rather than using the existing license matching utility.

I agree - I just checked the advice for adding the text files and it does mention removing some of the special characters. So changing the existing .txt files to match this advice seems entirely appropriate.

@swinslow - any concerns with this?

Mar 24 '23 17:03 goneall

I found this issue because of differences between what different languages consider to be whitespace. Seems that not all languages consider non-breaking space to be part of the POSIX space regular expression character class.

@jamin-aws-ospo Makes sense. I had to apply custom coding to handle this (and several other non-printable characters) in the SPDX license matching tools used by the SPDX online tools.

I had to make those changes anyway since the text being compared to (as apposed to the license text) also contained these characters.

I essentially normalized the to sides of the text being compared before applying any regular expressions or token based matching algorithms.

It definitely slows down the match, but it solves the problem.

Mar 24 '23 17:03 goneall

I essentially normalized the to sides of the text being compared before applying any regular expressions or token based matching algorithms.

I'm doing a similar normalization here, and comparing (what should be) the same normalization in two different languages. Which is how I found this issue.

Mar 24 '23 17:03 jamin-aws-ospo

I'm not sure I understand what the issue is. While in some cases the non-breaking space is probably an artifact of some copy-pasting, in other cases it's completely corret. For example, in the cases where the text is in French and non-breaking space is next to guillemets. Since we don't "ASCII-fy" the text, and we leave curly quotes, different dashes, etc., I don't see why these spaces pose an issue.

I'd be more interested to learn which environments do not consider this character as a space, since this is definitely a bug.

Mar 25 '23 18:03 zvr

@swinslow - any concerns with this?

@goneall No problem from my side if someone wants to submit a PR to replace non-breaking spaces with regular space characters.

To @zvr's point, downstream tools are going to have to take into account the fact that license texts are messy -- even the rare few that in fact have a license-steward-maintained "canonical" plain text version. So while we can clean up the plain-text versions used for the test text files for license-list-XML, anyone using them should be aware of what they are and what they aren't, and will likely have to handle special characters appropriately when encountering them in practice.

Mar 26 '23 15:03 swinslow

I'd be more interested to learn which environments do not consider this character as a space, since this is definitely a bug.

Ruby's regex \s+ does not match it (which is how I found it), while [[:space:]]+ does.

Mar 27 '23 16:03 jamin-aws-ospo

To be fair, this is documented: copying from https://ruby-doc.org/3.2.1/Regexp.html

/\s/ - character class: /[ \t\r\n\f\v]/

Mar 27 '23 20:03 zvr

@zvr What do you think about updating text files to normalize the spaces?

If we do normalize the spaces, we should also update the CI to check for these characters in the test files.

Mar 27 '23 20:03 goneall

Created a pull request that removes these characters: #1901

Mar 27 '23 20:03 jamin-aws-ospo

As I wrote above

Since we don't "ASCII-fy" the text, and we leave curly quotes, different dashes, etc., I don't see why these spaces pose an issue.

Why treat these as special?

Mar 27 '23 21:03 zvr

My pitch from the sidelines would be because they are invisible to even a technical user until they discover the bugs that these invisible characters lead to. They provide no value, just cost.

Fully agreed that valuable characters, for example non-breaking spaces next to guillemets, should be kept. If any of those are being removed in this PR, please call them out.

Mar 28 '23 16:03 hyandell

license-list-XML license-list-XML copied to clipboard

non-breaking spaces present in several text versions of licenses

license-list-XML
license-list-XML copied to clipboard