scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

Umlauts in copyrights are removed from output files

Open MMarwedel opened this issue 6 years ago • 13 comments

Hi, when scanning files with umlauts, they are converted to non umlauts. It should be better to keep them in the original form. Sample file: https://chromium.googlesource.com/native_client/nacl-newlib/+/master/newlib/libc/time/strptime.c Output: "holders": [ { "value": "Kungliga Tekniska Hogskolan (Royal Institute of Technology, Stockholm, Sweden).", "start_line": 2, "end_line": 4 } ], "copyrights": [ { "value": "Copyright (c) 1999 Kungliga Tekniska Hogskolan (Royal Institute of Technology, Stockholm, Sweden).", "start_line": 2, "end_line": 4 } ],

The right output would be ... Högskolan ...

MMarwedel avatar May 15 '19 12:05 MMarwedel

@MMarwedel this is indeed the case. What happens in fact is called transliteration, that is converting unicode characters to a plain ASCII form. And this effect is to remove umlauts and all other punctuation. I agree this is not perfect, and I am not sure what the rationale was when the decision was made. Since we are porting to Python 3 that uses unicode by default, it may no longer be issue in a the near future. So I will keep this open until we have completed the port so we can revisit how to fix this then. Would this work for you?

pombredanne avatar May 15 '19 16:05 pombredanne

Yes, this would work for me. As I am doing some postprocessing of the results anyway, I can fix the umlauts there for the few cases I found.

MMarwedel avatar May 16 '19 08:05 MMarwedel

If you are doing some post processing it would be best to handle this in here if possible... everyone could then benefit?

pombredanne avatar May 16 '19 09:05 pombredanne

Hmm... while I like to share code, I guess the post processing may not be in a state easy useable for other projects. And my employer would have to agree too.

MMarwedel avatar May 16 '19 09:05 MMarwedel

your call :)

pombredanne avatar May 16 '19 15:05 pombredanne

Now that the port to python3 happened, what are the chances that this issue is getting a look at?

I had a look into the current logic and it seems that the parse-tree breaks when using the non-ascii String version of a line (achieved by e.g. setting the to_ascii=True in the prepare_text_line method of src/cluecode/copyrights.py to False).

E.g. the tree for the line

Copyright (c) 2004-2007 Gerhard Häring

changes from (with to_ascii=True)

(S
  (COPYRIGHT
    Copyright/COPY
    (c)/COPY
    (NAME-YEAR
      (NAME-YEAR
        (NAME-YEAR
          (YR-RANGE (YR-RANGE 2004-2007/YR))
          Gerhard/NNP
          Haring/NNP)))))

to (with to_ascii=False)

  (COPYRIGHT
    Copyright/COPY
    (c)/COPY
    (NAME-YEAR
      (NAME-YEAR
        (NAME-YEAR (YR-RANGE (YR-RANGE 2004-2007/YR)) Gerhard/NNP))))
  Häring/NN)

I could not exactly find out what is breaking here internally but it's a lead at least.

A workaround could maybe be dragging the utf-8 string along for the process and enabling (maybe via flag?) to write out that original string to the result.json instead of the prepared ascii-String. This would mean the internal logic is not broken and still Umlauts and other utf-8 characters could be properly displayed in the result.

Ben-Thelen avatar Jun 11 '21 12:06 Ben-Thelen

With the latest switch from NLTK to Pygmars https://github.com/nexB/pygmars/ we now have more opportunities to fix lexing and support Unicode all the way. One of the reason why Haring is lexed as NNP and Häring is lexed as NN is due to several possible factors:

  1. the text may have been converted/transliterated to ASCII in pre-processing and this drops the umlaut
  • For instance, if we cannot obtain a proper decoded unicode text, we transliterate here https://github.com/nexB/scancode-toolkit/blob/3f7da81d6b207ac2b1d384defb83a5f2c82216f4/src/textcode/analysis.py#L81
  • As you noticed, the to_ascii transliteration is the default also in https://github.com/nexB/scancode-toolkit/blob/3f7da81d6b207ac2b1d384defb83a5f2c82216f4/src/cluecode/copyrights.py#L3305
  1. once you keep the umlauts, the regex used for token recognition aka. lexing https://github.com/nexB/scancode-toolkit/blob/3f7da81d6b207ac2b1d384defb83a5f2c82216f4/src/cluecode/copyrights.py#L456 are not aware of certain characters at two levels:
  • we ignored them when creating the initial lexing (not sure this was a conscious choice though that was making things simpler for sure)
  • until recently we could not patch NLTK easily to support unicode regex anyway but with Pygmars we could easily update https://github.com/nexB/pygmars/blob/1e63804bdb9152971fab21f802e23bb0301abaab/src/pygmars/lex.py#L94 to add a re.UNICODE flag

So I made a test with these fixes:

diff --git a/src/pygmars/lex.py b/src/pygmars/lex.py
index f60a9de..7fe50bc 100644
--- a/src/pygmars/lex.py
+++ b/src/pygmars/lex.py
@@ -91,7 +91,10 @@
         """
         try:
             self._matchers = [
-                (re.compile(m).match if isinstance(m, str) else m, label)
+                (
+                    re.compile(m, re.UNICODE).match if isinstance(m, str)else m,
+                    label,
+                )
                 for m, label in matchers
             ]
         except Exception as e:

and:

diff --git a/src/cluecode/copyrights.py b/src/cluecode/copyrights.py
index 74c5293..10214a4 100644
--- a/src/cluecode/copyrights.py
+++ b/src/cluecode/copyrights.py
@@ -3374,14 +3374,14 @@
 remove_man_comment_markers = re.compile(r'.\\"').sub
 
 
-def prepare_text_line(line, dedeb=True, to_ascii=True):
+def prepare_text_line(line, dedeb=True, to_ascii=False):
     """
     Prepare a text ``line`` for copyright detection.
 
     If ``dedeb`` is True, remove "Debian" <s> </s> markup tags seen in
     older copyright files.
 
-    If ``to_ascii`` convert the text to ASCiI characters.
+    If ``to_ascii`` convert the text to ASCII characters.
     """
     # remove some junk in man pages: \(co
     line = (line

and voila!

$ echo " * Copyright (c) 1999 Kungliga Tekniska Högskolan
>  * (Royal Institute of Technology, Stockholm, Sweden). 
>  * All rights reserved." > baz
r$ scancode -c --json-pp - baz
Setup plugins...
Collect file inventory...
Scan files for: copyrights with 1 process(es)...
[####################] 0             
{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "21.6.7",
      "options": {
        "input": [
          "baz"
        ],
        "--copyright": true,
        "--json-pp": "-"
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2021-07-07T092932.468312",
      "end_timestamp": "2021-07-07T092932.580056",
      "duration": 0.11177682876586914,
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "baz",
      "type": "file",
      "copyrights": [
        {
          "value": "Copyright (c) 1999 Kungliga Tekniska H\u00f6gskolan (Royal Institute of Technology, Stockholm, Sweden)",
          "start_line": 1,
          "end_line": 2
        }
      ],
      "holders": [
        {
          "value": "Kungliga Tekniska H\u00f6gskolan (Royal Institute of Technology, Stockholm, Sweden)",
          "start_line": 1,
          "end_line": 2
        }
      ],
      "authors": [],
      "scan_errors": []
    }
  ]
}Scanning done.
Summary:        copyrights with 1 process(es)
Errors count:   0
Scan Speed:     9.14 files/sec. 
Initial counts: 1 resource(s): 1 file(s) and 0 directorie(s) 
Final counts:   1 resource(s): 1 file(s) and 0 directorie(s) 
Timings:
  scan_start: 2021-07-07T092932.468312
  scan_end:   2021-07-07T092932.580056
  scan: 0.11s
  total: 0.12s
Removing temporary files...done.


>>> print( "Copyright (c) 1999 Kungliga Tekniska H\u00f6gskolan (Royal Institute of Technology, Stockholm, Sweden)")
Copyright (c) 1999 Kungliga Tekniska Högskolan (Royal Institute of Technology, Stockholm, Sweden)
>>> 

pombredanne avatar Jul 07 '21 09:07 pombredanne

This is still an issue with ScanCode 31.2.1. The Copyright (C) 2011 Felix Geisendörfer appears as

  "copyrights": [
    {
      "copyright": "Copyright (c) 2011 Felix Geisendorfer",
      "start_line": 1,
      "end_line": 1
    }
  ],

in the result. This is problematic as it might even have legal implications in the worst case if both "Felix Geisendörfer" and "Felix Geisendorfer" people exist.

sschuberth avatar Oct 19 '22 15:10 sschuberth

@sschuberth We still have not addressed this ... I have made some tests in the past in https://github.com/nexB/scancode-toolkit/commit/d35e308d0137630be2a1d349c4a10341d1d886ec but there were too many induced issues to complete this.

An alternative could be an "oe" and "ae" transliteration for German... do you think this could work out (I am not saying this could be simpler, btw)? ... but then, there are all the other languages.

pombredanne avatar Oct 20 '22 08:10 pombredanne

An alternative could be an "oe" and "ae" transliteration for German... do you think this could work out

No, I don't think so, because "oe" is not really equivalent to "ö" in German language. It's just a "work-around" if you have to stick to ASCII characters, but strictly (like, legally) speaking e.g. "Möller", "Moeller" and "Moller" are all different (and valid) family names.

sschuberth avatar Oct 20 '22 09:10 sschuberth

As a follow-up question: In

  "copyright": "Copyright (c) 2011 Felix Geisendorfer",
  "start_line": 1,
  "end_line": 1

is "copyright" always a full line match? Because if so, we could probably do some hacky post-processing that goes over all copyright findings and re-extracts the lines from the real files to get the original string.

sschuberth avatar Oct 20 '22 09:10 sschuberth

re: https://github.com/nexB/scancode-toolkit/issues/1566#issuecomment-1285190832

is "copyright" always a full line match? Because if so, we could probably do some hacky post-processing that goes over all copyright findings and re-extracts the lines from the real files to get the original string.

No, there are cases where a statement spans multiple lines and many cases where what is before and after a copyright statement is not part of the copyright at all. That being said, we could find way to get back to the original unprocessed text but the difficulty is that words and letters do not align one for one between the original text and its transliteration. We track neither position nor offsets be it of characters or words for now.

The general approach is roughly:

  • transliterate and/or extract strings (for binaries)
  • collect lines of text from that
  • identify regions of lines that may contain copyright/authors
  • for each region tokenize text in words
  • lex tokens to recognize and tag token sequences (such as a name, date range, copyright sign, etc.)
  • parse token sequences with a grammar to recognize actual copyright/author statements
  • do various misc post detection cleanups
  • return the statements (and holders separately)

As I said, we never track the position or offsets in the original text (which could be a binary). This would be technically possible, but there is a big overhead to track these. We only track line numbers

pombredanne avatar Oct 20 '22 09:10 pombredanne

I just happened to stumble upon this when scanning https://sourceforge.net/p/docutils/code/9561/tree/trunk/docutils/docutils/parsers/commonmark_wrapper.py as well.

stefan6419846 avatar Sep 29 '25 12:09 stefan6419846