Python bindings: use filesystem encoding instead of default encoding for pathnames
In python, there are two encodings to worry about when converting between str and bytes.
- The "default encoding" is utf-8 with strict error handling (it may be something other than utf-8 sometimes, but I couldn't find examples).
- The filesystem encoding has special handling that varies between Windows and non-Windows systems. It's intended for handling pathnames and environment variables.
Currently, libtorrent's python bindings use the default encoding everywhere when translating between str and the underlying std::string / char *. The default encoding fits the majority of cases, but isn't 100% correct on any system.
We should use the filesystem encoding instead, in at least the following places:
- [ ]
torrent_info(pathname) - [ ]
add_torrent_params.save_path - [ ]
set_piece_hashes(..., pathname) - [ ]
add_files(..., pathname) - [ ]
torrent_handle.move_storage(pathname) - [ ]
torrent_handle.set_ssl_certificate(cert_path, privkey_path, dhparams_path)(I think these are pathnames, anyway) - [ ]
torrent_status.save_path
(I'm finished writing new unit tests and won't add any more items to this issue)
In python, the filesystem encoding is available via os.fsencode() and os.fsdecode().
The usage impact will be that some data fields will now accept or return values that don't work with the default encoding. Code like this may now fail after this change, for edge-case pathnames on some systems:
try:
my_add_torrent_params.save_path = user_input_path
except UnicodeEncodeError:
show_error_message()
return
do_something(user_input_path.encode()) # raises UnicodeEncodeError, where it didn't before
This is a fringe case, but it technically makes this a breaking change.
What's the filesystem encoding?
Python's filesystem functions can accept or return either str or bytes, on any system.
open("foo.txt")is the same asopen(b"foo.txt")os.listdir(".")returns a list ofstrnamesos.listdir(b".")returns a list ofbytesnames
On non-Windows systems, pathnames are just byte strings, and are encoding-agnostic. In order to work with str, python attempts to decode pathnames with the default encoding (usually utf-8), but any non-decodable bytes are mapped to surrogates. So if a directory contains a file named b"\xff.txt", then os.listdir(".") returns ["\udcff.txt"]. This decoding is reversed elsewhere, so that open("\udcff.txt") will look for a file named b"\xff.txt".
Python's default strict error handling treats any surrogate codepoints as an error. "\udcff".encode() will raise UnicodeEncodeError.
On Windows, since 3.6, python internally uses the *W (not *A) APIs. str is a sequence of codepoints, so str pathnames are converted directly to/from the native unicode representation. bytes pathnames are accepted and returned as utf-8 (not utf-16). Here, utf-8 is unconditional, the "default encoding" is ignored. There is one subtlety: Windows doesn't validate surrogates,. so it's possible to have a file named "\udcff.txt". It's important to use the filesystem encoding even on Windows.
I updated this, and some other bugs, after some consideration.
I've been fuzzy on exactly where we should expect pathnames to be sanitized, where we should expect "unsanitized" pathnames from torrent files, and where should expect filesystem paths with semantics specific to the local system. There's not always specific documentation so I've been trying to piece it together from existing behavior and discussions.
My current understanding is:
torrent_info,file_storageandcreate_torrent(should) always sanitize their inputs. All strings are sanitized to utf-8. Pathnames are specially sanitized in a platform-agnostic way, except that on windows\is used as a path separator, otherwise it's/.- There's no access to unsanitized paths from torrent files, except
bdecode(torrent_info.metadata())or similar. - Pathnames elsewhere (notably
add_torrent_params.save_pathortorrent_status.save_path) are not sanitized. They're filesystem path strings, with system-specific semantics.
Is this accurate? Or, should it be accurate?
If torrent_info and friends always sanitize inputs to utf-8, the python bindings should probably not allow bytes as input.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
unstale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
unstale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
re-reading this ticket. I would expect that all strings on the python side are str and always contain unicode code points (in some encoding). Just because that string happens to represent a filename shouldn't really matter until you pass it to a function that interacts with the filesystem.
In your examples, the strings are being passed to libtorrent, which expects UTF-8 strings. It's then libtorrent's concern to make sure that it either uses a unicode or UTF-8 enabled API to interact with files, or converts the strings to the native code-page before accessing files.
I don't see how this would affect the python binding.
Here's an example of the problem:
A Linux user wants to use a save_path of /mnt/André.
But part of the pathname was created on a latin-1 computer. The é is really 0xE9. The path is not valid utf-8.
Python supports manipulating paths as str. On Linux, it does this by escaping the non-utf-8 parts as unicode surrogates. For the above example, the str looks like this:
>>> import os
>>> os.listdir("/mnt") # mapping happens transparently in fs functions
['Andr\udce9']
>>> os.fsdecode(b"/mnt/Andr\xE9") # explicitly translating strings
'/mnt/Andr\udce9'
This exploits the fact that a str is not a utf-8 string, nor a sequence of characters. It's a sequence of codepoints. The surrogates don't represent characters, they're used for utf-16 encoding.
But naked surrogates aren't allowed to exist in utf-8. This string can't be converted directly.
>>> "/mnt/Andr\udce9".encode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 9: surrogates not allowed
Boost.python is transparently trying to do the above conversion when assigning a str:
>>> import libtorrent
>>> atp = libtorrent.add_torrent_params()
>>> atp.save_path = "/mnt/Andr\udce9"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 9: surrogates not allowed
Assigning the bytes form works, but then boost.python tries to do utf-8 decoding when accessing the value:
>>> atp.save_path = b"/mnt/Andr\xe9"
>>> atp.save_path
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 9: unexpected end of data
This is why the mapping must be changed in the python bindings.
(This problem also exists on Windows; the "unicode APIs" have caveats so the mapping from pathnames to str is not plain utf-8 conversion)
My understanding is that, since python 3, all strings are Unicode strings. Set aside encoding, it's a sequence of code points. Is that an incorrect understanding?
Whether python encodes these as UTF-8, UTF-16 or UTF-32 internally doesn't really matter, right? That's an implementation detail. (Unless you're concerned about code points that cannot be represented by UTF-16, but it doesn't sound like that's the case).
You mention surrogates, which is a way of encoding certain code points in UTF-16. I don't understand how surrogates can leak out from the abstraction of the Unicode string. Any surrogate pair in UTF-16 can be represented in UTF-8 and UTF-32.
My understanding is that, since python 3, all strings are Unicode strings. Set aside encoding, it's a sequence of code points. Is that an incorrect understanding?
Whether python encodes these as UTF-8, UTF-16 or UTF-32 internally doesn't really matter, right? That's an implementation detail.
A str/PyUnicodeObject is a sequence of code points. Its internal representation is opaque.
The issue isn't the internal representation, but the converted output.
When converting from PyUnicodeObject to std::string, you must choose an encoding. There are two functions to care about: PyUnicode_AsUTF8String and PyUnicode_EncodeFSDefault. Currently boost.python always uses PyUnicode_AsUTF8String.
The right encoding to use depends on the context of the string. If its usage is a filesystem path or environment variable, then it will have been created with the filesystem encoding, so you must use the same encoding when converting to bytes.
I don't understand how surrogates can leak out from the abstraction of the Unicode string
A str is a sequence of code points. Not characters. Any code point is valid, including surrogates like U+DCE9.
The filesystem encoding will produce such strings. As I wrote above:
>>> import os
>>> os.listdir("/mnt") # result may contain surrogates, due to fs encoding
['Andr\udce9']
A str is a sequence of code points. Not characters. Any code point is valid, including surrogates like U+DCE9.
DCE9 is a high surrogate code unit in UTF-16 encoding. If that integer appears as a code point it falls in a private use code page.
The DCE9 code point can also be represented in UTF-8. The problem is that it's private use, so nobody knows what it means.
As I understand it, python ascribes its own meaning to (at least some) of these private use code points to allow round-tripping of byte sequences that are not valid UTF-8. Wikipedia links to this PEP.
In your example, though, the filename is a string that can be represented by unicode without loss. Presumably this issue would only come up if you truly take advantage of linux's feature of "filenames are just a bunch of bytes". Most software still interprets those bytes as UTF-8 and I would expect that to work well the vast majority of time.
Certainly libtorrent's internal representation of paths expects filenames to be representable by unicode.
In your example, though, the filename is a string that can be represented by unicode without loss.
My premise is the case where that's not true. The save_path is b"/mnt/Andr\xE9". Not valid UTF-8.
Presumably this issue would only come up if you truly take advantage of linux's feature of "filenames are just a bunch of bytes". Most software still interprets those bytes as UTF-8 and I would expect that to work well the vast majority of time.
I don't understand what you mean by "take advantage of". Paths on Linux are encoding-agnostic. It's a fact that can't be bypassed.
Certainly libtorrent's internal representation of paths expects filenames to be representable by unicode.
Are you talking about paths within a torrent? The topic is add_torrent_params::save_path, which refers to the filesystem, not torrent data. I can't find any place where libtorrent requires this to be unicode (see my example above where I set it to a non-utf-8 bytes value without error, but fail to access the value afterward).
My premise is the case where that's not true. The save_path is b"/mnt/Andr\xE9". Not valid UTF-8.
It's fine to not be valid UTF-8 as long as it is representable by unicode. Latin-1 "é" can be represented in unicode. So, if the system's locale is Latin-1, I would expect that filename to be converted to unicode without any problems.
I don't understand what you mean by "take advantage of".
I mean create a file or directory with a name that is not representable in unicode
I can't find any place where libtorrent requires this to be unicode
in libtorrent, the save_path is concatenated with filenames from the torrent file and then converted either to UTF-16 (on windows) or the system's locale (most commonly UTF-8) before being passed to the file functions. This is done with the convert_to_native()-family of functions.
in libtorrent, the
save_pathis concatenated with filenames from the torrent file and then converted either to UTF-16 (on windows) or the system's locale (most commonly UTF-8) before being passed to the file functions.
Ah, I didn't know about this.
So libtorrent expects pathnames (like save_path) to be in UTF-8, and transparently converts them to the system locale? Do I have that right?
It's really surprising for any library to do this. I would expect locale conversion should always be handled at the app level. Transparent conversion means the app might display one path to the user, but the library munges it to something different.
It's doubly surprising since I don't see this anywhere in libtorrent's documentation. This seems important to know.
What's the expected behavior if the app passes a save_path that's not UTF-8?
How should the app refer to an existing path that is invalid in the current locale? For example if the system locale is UTF-8, but the user wants to load an existing torrent rom b"/mnt/Andr\xe9".
It looks like this is the default behavior in Linux. Is that right?
I tried saving a torrent to b"/tmp/Andr\xe9" with a python script, using the pypi libtorrent build. On my system it did create the path b"/tmp/Andr\xe9", though I expected some conversion failure. Is this expected behavior?
Python script and output
import libtorrent as lt
import os
import os.path
import hashlib
def main() -> int:
data = b"\0" * 1024
fs = lt.file_storage()
fs.add_file("test.txt", 1024)
ct = lt.create_torrent(fs)
ct.set_hash(0, hashlib.sha1(data).digest())
atp = lt.add_torrent_params()
atp.save_path = b"/tmp/Andr\xe9"
atp.ti = lt.torrent_info(ct.generate())
session = lt.session()
handle = session.add_torrent(atp)
while True:
status = handle.status()
if status.state not in (
lt.torrent_status.states.checking_resume_data,
lt.torrent_status.states.checking_files,
):
break
if status.errc.value() != 0:
print(status.errc.message())
return 1
handle.add_piece(0, data, 0)
while True:
if handle.status().state == lt.torrent_status.seeding:
break
for name in os.listdir(b"/tmp"):
if name.startswith(b"Andr"):
print("libtorrent created: ", os.path.join(b"/tmp", name))
return 0
if __name__ == "__main__":
raise SystemExit(main())
Output on my system:
$ python t.py
libtorrent created: b'/tmp/Andr\xe9'
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.