reuse-tool icon indicating copy to clipboard operation
reuse-tool copied to clipboard

utf-8 in dep5 file results in errors on windows

Open thbde opened this issue 2 years ago • 1 comments

The following description can be reproduced via: https://github.com/thbde/reuse-utf8-dep5-issue

Assume that we use the dep5 file to declare the license state for a repository. Furthermore, the dep5 file contains utf-8 characters (or code points).

In that case, reuse will fail if we execute: reuse download --all

And the reported errors are rather confusing:

# Windows 10 64 bit
$ python --version
Python 3.10.1
$ reuse --version
reuse 0.14.0
$ python -c 'import locale;print(f"{locale.getpreferredencoding()=}")'
locale.getpreferredencoding()='cp1252'

$ reuse download --all
reuse.report - ERROR - Could not read 'call_open.py'
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 321: character maps to <undefined>
reuse.report - ERROR - Could not read 'file1.txt'
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 321: character maps to <undefined>
reuse.report - ERROR - Could not read 'file2.txt'
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 321: character maps to <undefined>
reuse.report - ERROR - Could not read 'file3.txt'
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 321: character maps to <undefined>
reuse.report - ERROR - Could not read 'output_reuse_call.txt'
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 321: character maps to <undefined>

The errors want to tell us:

  1. There is a UnicodeDecodeError in each file. But none of these files uses any special unicode or utf-8, all are ascii (only).
  2. The error is at the exact same place for all files. How can this be?

The root cause (after quite some debugging) is the open call for the dep5 file: https://github.com/fsfe/reuse-tool/blob/60c0986bb24a3b482ee0527e9195f7c23cadb003/src/reuse/project.py#L216-L219

  1. The file is opened via open() which uses a platform dependent encoding as a default (locale.getpreferredencoding() which is cp1252 on windows)
  2. The Copyright class does somewhere deep inside quite some nested functions a
for element in fd:
  ...

(more precisely, here: https://salsa.debian.org/python-debian-team/python-debian/-/blob/278e016f5ed8d3ed4fa17d7c30b54c149d428808/lib/debian/deb822.py#L759 )

This implicitly reads the file descriptor (via an iterator that fd implements) and therefore we encounter a decoding issue. You can also reproduce that part via https://github.com/thbde/reuse-utf8-dep5-issue/blob/90f323b6b2af6b94fd3a461a986898b48ec5c6c9/call_open.py

Now, this error is not reported as such due to how reuse deals with the file analysis.

The shown error comes from here: https://github.com/fsfe/reuse-tool/blob/60c0986bb24a3b482ee0527e9195f7c23cadb003/src/reuse/report.py#L207-L210

result is created by mapping a method to all files: https://github.com/fsfe/reuse-tool/blob/60c0986bb24a3b482ee0527e9195f7c23cadb003/src/reuse/report.py#L197

And that method, a call to the constructor of _MultiprocessingContainer, just swallows the full error message in a _MultiprocessingResult: https://github.com/fsfe/reuse-tool/blob/60c0986bb24a3b482ee0527e9195f7c23cadb003/src/reuse/report.py#L37-L46

which uses the file path of the file that is currently analyzed.

In conclusion, I see two issues here:

  1. The open() call without an encoding
    • I will sent a PR to fix this
    • Maybe someone wants to consider testing with PEP 597 that is available for Python 3.10
  2. The way how error reporting is organized which attributes all errors to the given file although the file my not be the culprit.
    • This is an architecture decision for which I cannot sent a PR because I am not familiar enough with the project. I would appreciate if this may get considered for an improvement, if not only for the sanity of future users that need to debug errors here such as myself.

thbde avatar Jan 07 '22 11:01 thbde

With latest code using reuse spdx I still get this:

FileName: ./gradlew
SPDXID: SPDXRef-dc243f792038baeebbca36717e0d2288
FileChecksum: SHA1: 0e59ccf04f8db22729ebef7ee39517a9e3a80c9d
LicenseConcluded: NOASSERTION
LicenseInfoInFile: Apache-2.0
FileCopyrightText: <text>Copyright © 2015-2021 the original authors.
Copyright © 2015-2021 the original authors.</text>

The correct one is from the file itself and the broken one is from .reuse/dep5

To note that if I add UTF-8 BOM to .reuse/dep5 then it fails with:

.reuse/dep5 has syntax errors
Traceback (most recent call last):
  File "c:\program files\python\lib\site-packages\reuse\project.py", line 219, in _copyright
    self._copyright_val = Copyright(fp)
  File "c:\program files\python\lib\site-packages\debian\copyright.py", line 156, in __init__
    self.__header = Header(paragraphs[0])
  File "c:\program files\python\lib\site-packages\debian\copyright.py", line 666, in __init__
    'input is not a machine-readable debian/copyright')
debian.copyright.NotMachineReadableError: input is not a machine-readable debian/copyright.reuse/dep5 has syntax errors
Traceback (most recent call last):
  File "c:\program files\python\lib\site-packages\reuse\project.py", line 219, in _copyright
    self._copyright_val = Copyright(fp)
  File "c:\program files\python\lib\site-packages\debian\copyright.py", line 156, in __init__
    self.__header = Header(paragraphs[0])
  File "c:\program files\python\lib\site-packages\debian\copyright.py", line 666, in __init__
    'input is not a machine-readable debian/copyright')
debian.copyright.NotMachineReadableError: input is not a machine-readable debian/copyright
.reuse/dep5 has syntax errors
Traceback (most recent call last):
  File "c:\program files\python\lib\site-packages\reuse\project.py", line 219, in _copyright
    self._copyright_val = Copyright(fp)
  File "c:\program files\python\lib\site-packages\debian\copyright.py", line 156, in __init__
    self.__header = Header(paragraphs[0])
  File "c:\program files\python\lib\site-packages\debian\copyright.py", line 666, in __init__
    'input is not a machine-readable debian/copyright')
debian.copyright.NotMachineReadableError: input is not a machine-readable debian/copyright

ale5000-git avatar May 12 '22 17:05 ale5000-git