Python bindings: deprecate some usage of str / bytes
Currently, the python bindings generally allow str as input where std::string or char * is expected internally. It attempts to convert str to bytes with the default encoding.
In some cases, I think this does more harm than good. I think we should deprecate this functionality (and fire DeprecationWarning).
- [x]
bencode(...)should not allowstrin its input, includingdictkeys: this causes unstable round-trips (v != bdecode(bencode(v))) and I can't think of any benefit from it - [x]
torrent_info({"info": ...})should not allowstrin its input: as above - [x]
torrent_handle.add_piece(i, "value")should not acceptstr: any use of this is likely unintentional - [x]
sha1_hash("value")should not acceptstr: any use of this is likely unintentional - [x]
sha1_hash.to_string(): it's hard to imagine this was ever useful, outside unit tests - [x]
generate_fingerprint()should be deprecated, in favor of a version that returnsbytes: I think it's sensible to acceptstrhere, but the return value is intended for use in binary protocols, and client code should handle it as such - [x]
session.dht_get_mutable_item("key", "salt")should not acceptstr - [x]
session.dht_put_mutable_item("private", "public", "data", "salt")should not acceptstr - [x]
torrent_handle.rename_file(..., b"path")should not acceptbytes - ~~
torrent_handle.set_ssl_certificate(..., passphrase="passphrase")should not acceptstr~~ - [x]
file_storage.add_file(b"path", ...)should not acceptbytes - [x]
file_storage.add_file(..., linkpath=b"path")should not acceptbytes - [x]
dht_mutable_item_alert.saltisstr, and should just be changed tobytes. All other members of this alert are broken currently so I can't imagine it would marginally break anyone to just change the binding tobytes - [x]
dht_put_item_alert.saltisstr, and should bebytes, as above
I'm finished writing new unit tests and won't add any more items to this issue.
There's a solid argument that lots of str accessors should be deprecated, like file_storage.file_path() or even torrent_info.name(). These functions access encoding-agnostic byte strings, which may be created with unknown encodings on other systems, so it's usually a mistake to access them as str
However there are cases where str access is appropriate, such as an app creating torrent_info with synthetic data and filenames (not sourced from the filesystem). If the data was sourced from python str, it's totally fair to access it as str. Deprecating str access would be pretty unfriendly to such a case.
I am on the fence about whether we should be unfriendly to a fringe use case, in favor of preventing the huge majority of use cases from doing something wrong.
when libtorrent loads a torrent, it sanitizes file paths/names as well as comment and creator strings. They should be valid UTF-8 strings (possibly with replacement characters for invalid sequences).
Is this not your experience?
Perhaps I'm a bit too ignorant of python, but str is always a unicode string, right? or is there a new type for that?
I agree that sha1_hash(), add_piece(), 'generate_fingerprint()' and sha1_hash.to_string() should not use str.
How come str inputs to bencode() causes round-trip issues? You mean because when you decode it, all strings are bytes?
I'm not sure I'm convinced of that. If you have a unicode string it seems convenient to not have to encode it explicitly.
when libtorrent loads a torrent, it sanitizes file paths/names as well as comment and creator strings. They should be valid UTF-8 strings (possibly with replacement characters for invalid sequences).
Is this not your experience?
Ahh, I didn't think about the fact that the underyling torrent_info always does sanitization. I was focused on the fact that I can create a lot of non-utf8 data by instantiating file_storage() / create_torrent() directly.
So I'm wrong; I guess there isn't a case where torrent_info.name() is "wrongly" translating encoding-agnostic strings. The underlying c++ code already munged it to unicode, so it's appropriate for the python bindings to always present that as str.
You're right, python str is unicode.
How come
strinputs tobencode()causes round-trip issues? You mean because when you decode it, all strings are bytes? I'm not sure I'm convinced of that. If you have a unicode string it seems convenient to not have to encode it explicitly.
There's a particular danger of allowing str inputs to bencode(); the strings will be pathnames in many cases. If they came from the filesystem then they use python's filesystem encoding, and handling them with strict utf-8 encoding is inappropriate, but will work most of the time and so be a good source of bugs. I talk about this more in #5984.
For this reason and others, there has been steady momentum over the years to eliminate implicit encoding, or mixing str and bytes. I don't think it's too surprising to a python dev to see bencode() requires bytes. In the standard library today and the popular third-party libraries, I can't think of any cases where bytes/str can be mixed freely. There are some "templated" functions and data structures, but I can't think of a way to apply that to bencode()/bdecode().
Given all that, I think the burden is to come up with a good use case where mixing str/bytes provides value, and I haven't been able to come up with this. Most of my usage of bencode() is for creating synthetic data for unit tests, where it's as easy to write b"test.txt" as "test.txt". Otherwise I do some manipulation of resume data.
If I want to craft new torrents, I think create_torrent()/file_storage() is a fine high-level interface, where str is more appropriate. It's certainly more convenient than handcrafting an info-dict.
Deluge uses a custom bencode function, not libtorrent's. Not sure why.
Responding here from #6125
What kind of test do you envision for the
strpatch? that all valid "entry" objects round-trip throughbencode()->bdecode()?
I think the tests for str should be the same as the tests for other deprecated inputs -- that they bencode() to expected values, but generate warnings.
Test-wise, I think testing round-trip stability (v == bdecode(bencode(v))) makes tests readable, but it's not really a guarantee of the API especially in the face of "preformatted"-type inputs.
I do think round-trip stability is best, to improve clarity and reduce usage bugs. But unit tests should test guarantees, not design goals.
I actually think we should add tests that bdecode(bencode("abc")) == b"abc" to ensure we don't change functionality while deprecating, since the point of deprecating is to keep functionality for a while.
I changed my mind about torrent_handle.set_ssl_certificate(..., passphrase="str"). Passphrases are ultimately bytes, but I think it's probably fine if some app only supports them in a str context, so we should just leave it allowed. Removed it from the list.