griffe icon indicating copy to clipboard operation
griffe copied to clipboard

Always use `encoding="utf-8-sig"` when reading text files

Open john-hen opened this issue 7 months ago • 3 comments

Changed the encoding from utf8 to utf-8-sig when reading files, in order to ignore a possible byte-order mark (a.k.a. BOM, code point U+FEFF) at the start of the file.

As per the Python documentation:

In some areas, it is also convention to use a “BOM” at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. For reading such files, use the ‘utf-8-sig’ codec to automatically skip the mark if present.

https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data

So this change won't affect reading UTF8-encoded files without a BOM.

Fixes #386.

john-hen avatar Jun 08 '25 15:06 john-hen

I don't think this PR has anything to do with the reported Mypy errors, unless I'm missing something. (I only ran pytest before submitting.)

john-hen avatar Jun 08 '25 15:06 john-hen

Thanks! You can rebase on main to get rid of the mypy warnings :+1:

pawamoy avatar Jun 08 '25 16:06 pawamoy

Oh, can you please add a test that runs on Windows only, asserting the fix works? It should check that trying to load a BOM'd module with UTF8 raises a LoadingError, while it works with UTF8-SIG.

pawamoy avatar Jun 08 '25 17:06 pawamoy