libtorrent Python bindings: deprecate some usage of str / bytes

Currently, the python bindings generally allow str as input where std::string or char * is expected internally. It attempts to convert str to bytes with the default encoding.

In some cases, I think this does more harm than good. I think we should deprecate this functionality (and fire DeprecationWarning).

[x] bencode(...) should not allow str in its input, including dict keys: this causes unstable round-trips (v != bdecode(bencode(v))) and I can't think of any benefit from it
[x] torrent_info({"info": ...}) should not allow str in its input: as above
[x] torrent_handle.add_piece(i, "value") should not accept str: any use of this is likely unintentional
[x] sha1_hash("value") should not accept str: any use of this is likely unintentional
[x] sha1_hash.to_string(): it's hard to imagine this was ever useful, outside unit tests
[x] generate_fingerprint() should be deprecated, in favor of a version that returns bytes: I think it's sensible to accept str here, but the return value is intended for use in binary protocols, and client code should handle it as such
[x] session.dht_get_mutable_item("key", "salt") should not accept str
[x] session.dht_put_mutable_item("private", "public", "data", "salt") should not accept str
[x] torrent_handle.rename_file(..., b"path") should not accept bytes
~~torrent_handle.set_ssl_certificate(..., passphrase="passphrase") should not accept str~~
[x] file_storage.add_file(b"path", ...) should not accept bytes
[x] file_storage.add_file(..., linkpath=b"path") should not accept bytes
[x] dht_mutable_item_alert.salt is str, and should just be changed to bytes. All other members of this alert are broken currently so I can't imagine it would marginally break anyone to just change the binding to bytes
[x] dht_put_item_alert.salt is str, and should be bytes, as above

I'm finished writing new unit tests and won't add any more items to this issue.

Feb 18 '21 01:02 AllSeeingEyeTolledEweSew

There's a solid argument that lots of str accessors should be deprecated, like file_storage.file_path() or even torrent_info.name(). These functions access encoding-agnostic byte strings, which may be created with unknown encodings on other systems, so it's usually a mistake to access them as str

However there are cases where str access is appropriate, such as an app creating torrent_info with synthetic data and filenames (not sourced from the filesystem). If the data was sourced from python str, it's totally fair to access it as str. Deprecating str access would be pretty unfriendly to such a case.

I am on the fence about whether we should be unfriendly to a fringe use case, in favor of preventing the huge majority of use cases from doing something wrong.

Feb 18 '21 02:02 AllSeeingEyeTolledEweSew

when libtorrent loads a torrent, it sanitizes file paths/names as well as comment and creator strings. They should be valid UTF-8 strings (possibly with replacement characters for invalid sequences).

Is this not your experience?

Perhaps I'm a bit too ignorant of python, but str is always a unicode string, right? or is there a new type for that?

I agree that sha1_hash(), add_piece(), 'generate_fingerprint()' and sha1_hash.to_string() should not use str.

How come str inputs to bencode() causes round-trip issues? You mean because when you decode it, all strings are bytes? I'm not sure I'm convinced of that. If you have a unicode string it seems convenient to not have to encode it explicitly.

Apr 10 '21 20:04 arvidn

when libtorrent loads a torrent, it sanitizes file paths/names as well as comment and creator strings. They should be valid UTF-8 strings (possibly with replacement characters for invalid sequences).

Is this not your experience?

Ahh, I didn't think about the fact that the underyling torrent_info always does sanitization. I was focused on the fact that I can create a lot of non-utf8 data by instantiating file_storage() / create_torrent() directly.

So I'm wrong; I guess there isn't a case where torrent_info.name() is "wrongly" translating encoding-agnostic strings. The underlying c++ code already munged it to unicode, so it's appropriate for the python bindings to always present that as str.

You're right, python str is unicode.

How come str inputs to bencode() causes round-trip issues? You mean because when you decode it, all strings are bytes? I'm not sure I'm convinced of that. If you have a unicode string it seems convenient to not have to encode it explicitly.

There's a particular danger of allowing str inputs to bencode(); the strings will be pathnames in many cases. If they came from the filesystem then they use python's filesystem encoding, and handling them with strict utf-8 encoding is inappropriate, but will work most of the time and so be a good source of bugs. I talk about this more in #5984.

For this reason and others, there has been steady momentum over the years to eliminate implicit encoding, or mixing str and bytes. I don't think it's too surprising to a python dev to see bencode() requires bytes. In the standard library today and the popular third-party libraries, I can't think of any cases where bytes/str can be mixed freely. There are some "templated" functions and data structures, but I can't think of a way to apply that to bencode()/bdecode().

Given all that, I think the burden is to come up with a good use case where mixing str/bytes provides value, and I haven't been able to come up with this. Most of my usage of bencode() is for creating synthetic data for unit tests, where it's as easy to write b"test.txt" as "test.txt". Otherwise I do some manipulation of resume data.

If I want to craft new torrents, I think create_torrent()/file_storage() is a fine high-level interface, where str is more appropriate. It's certainly more convenient than handcrafting an info-dict.

Deluge uses a custom bencode function, not libtorrent's. Not sure why.

Apr 10 '21 22:04 AllSeeingEyeTolledEweSew

Responding here from #6125

What kind of test do you envision for the str patch? that all valid "entry" objects round-trip through bencode() -> bdecode()?

I think the tests for str should be the same as the tests for other deprecated inputs -- that they bencode() to expected values, but generate warnings.

Test-wise, I think testing round-trip stability (v == bdecode(bencode(v))) makes tests readable, but it's not really a guarantee of the API especially in the face of "preformatted"-type inputs.

I do think round-trip stability is best, to improve clarity and reduce usage bugs. But unit tests should test guarantees, not design goals.

I actually think we should add tests that bdecode(bencode("abc")) == b"abc" to ensure we don't change functionality while deprecating, since the point of deprecating is to keep functionality for a while.

Apr 11 '21 16:04 AllSeeingEyeTolledEweSew

I changed my mind about torrent_handle.set_ssl_certificate(..., passphrase="str"). Passphrases are ultimately bytes, but I think it's probably fine if some app only supports them in a str context, so we should just leave it allowed. Removed it from the list.

Aug 02 '21 18:08 AllSeeingEyeTolledEweSew