dnfile
dnfile copied to clipboard
fix: allow UTF-16 surrogates to be passed through
The UserStrings can be UTF-16-LE encoded values with "odd" surrogate code points. Per the Wikipedia page on UTF-16:
The official Unicode standard says that no UTF forms, including UTF-16, can encode the surrogate code points. Since these will never be assigned a character, there should be no reason to encode them. However, Windows allows unpaired surrogates in filenames and other places, which generally means they have to be supported by software in spite of their exclusion from the Unicode standard.
This change makes it so at least we get them back as valid python unicode characters, rather than omitting the string.
Can you share a file that contains such a string? I want to add a test case.
I do not think surrogatepass is the right choice for everyone. What about replace or backslashreplace? I would rather give people the option to choose an error handler. What would you think of changing your PR to add an errors="strict" argument to the UserString.__init__() and to the UserStringHeap.get() and propagating those to the line that you changed instead of hardcoding surrogatepass? This would preserve current behavior and give you the choice of using surrogatepass in your application?
OK, I added an error_handler: str = "strict" to the UserString constructor as well as UserStringHeap#get
I'm trying to add test coverage to match, but I'm unsure how the exe fixtures are being generated?
Here's an example of a unicode string with unpaired surrogates: "\ud8b3例"
Here's a round-tripping example:
>>> "\ud8b3例"
'\ud8b3例'
>>> "\ud8b3例".encode('utf-16-le')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud8b3' in position 0: surrogates not allowed
encoding with 'utf-16-le' codec failed
>>> "\ud8b3例".encode('utf-16-le', errors='surrogatepass')
b'\xb3\xd8\xb5\xf9'
>>> "\ud8b3例".encode('utf-16-le', errors='surrogatepass').decode('utf-16-le')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/dbushong/.local/share/uv/python/cpython-3.12.9-macos-aarch64-none/lib/python3.12/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal UTF-16 surrogate
decoding with 'utf-16-le' codec failed
>>> "\ud8b3例".encode('utf-16-le', errors='surrogatepass').decode('utf-16-le', errors='surrogatepass')
'\ud8b3例'