libtorrent Python bindings: various apparent gaps in type signature mappings

There are a few functions in the python bindings that seem like they're intended to work given the source code, but invoking the bindings doesn't work or doesn't produce the expected result.

[x] create_torrent.add_url_seed("str arg"): produces unexpected results in output of create_torrent.generate(). maybe the input got mapped incorrectly?
[x] create_torrent.add_http_seed("str arg"): produces unexpected results in output of create_torrent.generate(). maybe the input got mapped incorrectly?
[x] create_torrent.add_collection("str arg"): produces unexpected results in output of create_torrent.generate(). maybe the input got mapped incorrectly?
[x] create_torrent.set_root_cert("str arg"): produces unexpected results in output of create_torrent.generate(). maybe the input got mapped incorrectly? I'm not sure if this function should be deprecated, or if it should be allowed, so as to accept PEM data as a str.
[x] fingerprint.name: accessing the attribute raises TypeError
[x] ip_filter.export_filter(): raises TypeError
~~torrent_info(bytearray(bencode({b"info": {...}}))): raises RuntimeError("torrent file is not a dictionary"), though the source code implies this should work~~
~~torrent_info(memoryview(bencode({b"info": {...}}))): raises RuntimeError("torrent file is not a dictionary"), though the source code implies this should work~~
[x] torrent_info(info_hash_t(sha1_hash(b"a" * 20))): raises RuntimeError("torrent file is not a dictionary"), though this should probably work
[x] torrent_info.add_tracker("str arg") does not match a type signature, though source code indicates it should allow the other arguments to be defaults
[x] torrent_info.add_url_seed("str arg") does not match a type signature, though source code indicates it should allow the other arguments to be defaults
[x] torrent_info.add_url_seed("str arg", "str arg", []) does not match a type signature, though source code indicates it should work
[x] torrent_info.add_http_seed("str arg") does not match a type signature, though source code indicates it should allow the other arguments to be defaults
[x] torrent_info.add_http_seed("str arg", "str arg", []) does not match a type signature, though source code indicates it should work
[x] torrent_info.file_at_offset(0) raises TypeError
[x] session_status.active_requests raises TypeError
[x] add_torrent_params.unfinished_pieces = {} does not match a type signature
[x] add_torrent_params.renamed_files = {} does not match a type signature
[x] add_torrent_params.merkle_tree = [] does not match a type signature
[x] session.dht_announce(sha1_hash(b"a"*20), 0, 0) does not match a type signature
[x] announce_flags argument (for dht_announce()) doesn't seem to be mapped anywhere, unless I missed it
~~session.find_torrent(info_hash_t(...)) isn't bound, but seems to exist in C++ code~~
[x] session.add_port_mapping(...) raises TypeError apparently due to being unable to type-map the return value
[x] torrent_handle.file_progress() does not match a type signature
[x] peer_disconnected_alert.reason does not match a type signature
[x] picker_log_alert.picker_flags does not match a type signature
[x] picker_log_alert.blocks() does not match a type signature
[x] dht_log_alert.module does not match a type signature
[x] dht_immutable_item_alert.item raises "invalid type requested from entry"
[x] dht_mutable_item_alert.key raises "no Python class registered"
[x] dht_mutable_item_alert.item raises "invalid type requested from entry"
[x] dht_mutable_item_alert.signature raises "no Python class registered"
[x] dht_put_item_alert.public_key raises "no Python class registered"
[x] dht_put_item_alert.signature raises "no Python class registered"

I'm finished writing new unit tests and won't add any more items to this issue.

Feb 18 '21 05:02 AllSeeingEyeTolledEweSew

there should be unit tests for these in test.py. when you say "unexpected output" what do you mean exactly? Is the string not treated as unicode? in .torrent files, strings are supposed to be encoded as utf-8 if they are meant to be human readable, and raw-bytes otherwise.

So, piece hashes are binary, name, filenames, creator and comment are utf-8.

It's not obvious to me that the character strings functions should accept bytes as well, but perhaps there's something I'm not considering.

Mar 21 '21 10:03 arvidn

there should be unit tests for these in test.py. when you say "unexpected output" what do you mean exactly?

In master on Linux:

>>> import libtorrent as lt
>>> fs = lt.file_storage()
>>> fs.add_file("test.txt", 1024)
>>> ct = lt.create_torrent(fs)
>>> ct.set_hash(0, b"a" * 1024)
>>> ct.add_url_seed("http://example.com")
>>> ct.generate()
{b'creation date': 1617038247, b'info': {b'length': 1024, b'name': b'test.txt', b'piece length': 16384, b'pieces': b'aaaaaaaaaaaaaaaaaaaa'}, b'url-list': b'h\x00\x00\x00t\x00\x00\x00t\x00\x00\x00p\x00\x00\x00:\x00\x00\x00/\x00\x00\x00/\x00\x00\x00e\x00\x00\x00x\x00\x00\x00a\x00\x00\x00m\x00\x00\x00p\x00\x00\x00l\x00\x00\x00e\x00\x00\x00.\x00\x00\x00c\x00\x00\x00o\x00\x00\x00m\x00\x00\x00'}

in .torrent files, strings are supposed to be encoded as utf-8 if they are meant to be human readable, and raw-bytes otherwise.

So, piece hashes are binary, name, filenames, creator and comment are utf-8.

It's not obvious to me that the character strings functions should accept bytes as well, but perhaps there's something I'm not considering.

utf-8 is ideal, but not enforced. In the wild I find a lot of cp1252 (or subsets or lookalikes, it's impossible to know), including a significant portion of "blessed" torrents on private trackers.

Even a careful, BEP-aware torrent creator may want to use strings of non-utf-8 or unknown encoding:

A tool that modifies or re-encodes an existing torrent may want to pass-through existing strings of unknown encoding, or reuse them elsewhere, especially collections
A tool that creates torrents intending that they will be decoded correctly by a target implementation/platform may want to intentionally encode everything as cp1252
Posix platforms treat filenames as opaque binary. Tools may want to pass through the opaque data, or generate non-utf-8 filenames for use by some downstream software.

I think strings with unknown encoding are here to stay, because

BEP 52 says that utf-8 MUST be used for "human-readable" data. This appears to only apply to filenames, but for those it says that utf-8 is only used IF the encoding is "known at creation time", implying arbitrary data MAY be used even for filenames.
I can't find any BEP that explicitly calls out any string as "human-readable" in any case. It seems fair for an implementation to decide that collections or creator may be binary hash values. In the real world, filenames are read by software as well as humans, and "human-readabe" is not the lowest common denominator between the two.
As above, filenames on posix platforms never have "known" encoding (except if the torrent creator also created the files themselves). They can't be reliably converted to utf-8.
This is not "solved" on windows, as its wide-char filenames can include invalid surrogate pairs, which also can't be converted to utf-8.
There is no reliable test for whether a string is utf-8. It has no explicit signature, and can be ambiguous with other encodings. Even if an implementation thinks a string is probably utf-8 but not explicitly marked, the careful approach to treat it as a string of unknown encoding. So a careful implementation must treat all strings in a BEP-compliant torrent as unknown encoding, even if it thinks the creator would have used utf-8 when "known".
Even a careful, BEP-compliant implementation can't explicitly mark that utf-8 was used in torrent creation. name.utf-8 and path.utf-8 exist, but appear to be nonstandard, and I can't see that the pattern has been used for other strings like comment. When the knowledge of a string's encoding is lost, a new string of unknown encoding is created.
Even if a new BEP solved encoding problems, new software can't reject invalid or ambiguously encoded torrents, as there are too many in the wild. Private trackers and "the scene" have arcane practices around fixing data; I have seen them refuse to fix e.g. obvious typo's in filenames, if the typo is part of a scene group's official release
Guessing or clobbering encodings and filenames is common practice with bittorrent software today, as a result of these and other issues. But this creates compatibility hell, where software must maintain its guesses and clobbers in future releases, and be compatible with those of others. It's far better to guess and clobber as little as possible, and handle strings as having unknown encoding if encoding isn't explicitly specified.
Even if guessing and clobbering works seamlessly 99% of the time, my overwhelming experience is that the other 1% of cases has their impact amplified by distorted expectations. Either developers miss test cases since they're so narrow, or users expect it to always work since it usually works.

Zooming out, I've tried to research prior art to follow patterns for how the python bindings should deal with string encodings here, but I think correct solutions are always going to be awkward because the BEPs have impedence mismatch with python.

Since 3.0 python brightened the line between str and bytes, i.e. character strings of explicit encoding vs opaque binary data. The BEPs' approach to encodings seems to make more sense in the C/C++ world, where string decoding is more rare, more ad-hoc, and is often done just for user presentation, so the user can correct errors visually or specify a different encoding.

Unfortunately I think the BEPs are so loosely written that it's pretty inevitable that implementations will continue to create a lot of weird data, whether for good reasons or ambiguity. So a feature-complete bittorrent implementation has a lot of work to do to support the lowest common denominator.

Mar 29 '21 20:03 AllSeeingEyeTolledEweSew

>>> import libtorrent as lt
>>> fs = lt.file_storage()
>>> fs.add_file("test.txt", 1024)
>>> ct = lt.create_torrent(fs)
>>> ct.set_hash(0, b"a" * 1024)
>>> ct.add_url_seed("http://example.com")
>>> ct.generate()

The unexpected part is that the strings are bytes? I think that makes sense, especially given all the arguments you made in favor of being encoding agnostic. But also, at the bencoding layer, you don't know whether a string of bytes should be interpreted as bytes are a utf-8 string.

Mar 29 '21 21:03 arvidn

It's this part: b'url-list': b'h\x00\x00\x00t\x00\x00\x00t\x00\x00\x00p.... It's like each character got cast to an int32 or something.

Mar 29 '21 21:03 AllSeeingEyeTolledEweSew

ah, yes. That doesn't look right

Mar 30 '21 08:03 arvidn

I believe this would fix ~~most~~ some of these issues. All the ones where a function takes a string_view at least.

https://github.com/arvidn/libtorrent/pull/6126

Apr 10 '21 17:04 arvidn

torrent_info.file_at_offset() is deprecated and returns an internal (also deprecated) type. I don't think it would make sense to make this work. Since it doesn't work now, likely, barely anybody is using it. I'd like to keep it that way so I think just removing it from the binding is the best option.

Apr 11 '21 14:04 arvidn

same thing goes for add_http_seed() actually. It's also deprecated in recent versions of libtorrent, and if it isn't working now, I don't think it should be fixed.

Apr 11 '21 14:04 arvidn

torrent_info.file_at_offset() is deprecated and returns an internal (also deprecated) type. I don't think it would make sense to make this work. Since it doesn't work now, likely, barely anybody is using it. I'd like to keep it that way so I think just removing it from the binding is the best option.

same thing goes for add_http_seed() actually. It's also deprecated in recent versions of libtorrent, and if it isn't working now, I don't think it should be fixed.

Seems good, especially if these were also deprecated in RC_1_2

Apr 11 '21 16:04 AllSeeingEyeTolledEweSew

https://github.com/arvidn/libtorrent/pull/6129 fixes the ip_filter issue

Apr 11 '21 17:04 arvidn

fingerprint.name is deprecated in 1.2.x, so rather than fixing it I will remove it

Apr 11 '21 17:04 arvidn

https://github.com/arvidn/libtorrent/pull/6130

Apr 11 '21 17:04 arvidn

I'm not convinced about set_creator() and set_comment() taking bytes. that seems wrong to me.

Apr 11 '21 17:04 arvidn

I suggested it to be consistent with add_collection() which takes bytes.

I think it's also reasonable to deprecate add_collection() taking bytes for consistency, if the intent of create_torrent is to enforce standards like utf-8 everywhere.

Or we could do neither, but even in this case I thought it's a good idea to have an intentional plan, and some unit tests

Apr 11 '21 20:04 AllSeeingEyeTolledEweSew

https://github.com/arvidn/libtorrent/pull/6149

Apr 15 '21 22:04 arvidn

https://github.com/arvidn/libtorrent/pull/6150

Apr 16 '21 00:04 arvidn

@arvidn I realized that #6126 may break more than you intend. torrent_info(bytes) had previously treated the input as a filename, but now it treats it as bencoded data.

I actually think it's a really good feature (it's good for bdecoding to happen in C++, rather than python).

But this may break some valid usage today. Python has good support of using bytes pathnames. Pathnames always can be converted to str with os.fsdecode but callers would need to do this explicitly. (See #5984 for more info)

Options I can think of:

My preference: keep torrent_info(bytes) as a constructor from pathname, and add a factory function. A common pattern in python is torrent_info.from_bencoded(bytes), not sure if boost.python makes that easy or not. Otherwise a top-level function would be fine. It would also be nice if torrent_info.from_bencoded(memoryview) worked: memoryview is like void* in python, and allows zero-copy data passing. On the cpython side, there's a "buffer protocol" for zero-copy access of a buffer-like python object. It would be nice if torrent_info.from_bencoded(...) took any argument that supported the buffer protocol
Use torrent_info(bytearray) and torrent_info(memoryview) as constructors from bencoded data, and torrent_info(bytes) as a constructor from pathname.

Apr 17 '21 18:04 AllSeeingEyeTolledEweSew

Also, #6126 (or something else) broke create_torrent.set_root_cert(bytes). Was this intended? I haven't figured out what the right test case is meant to be

Apr 17 '21 18:04 AllSeeingEyeTolledEweSew

create_torrent.add_collection(bytes) is also broken. Maybe it should be removed per https://github.com/arvidn/libtorrent/issues/5984#issuecomment-820767114 , but let's deprecate it first

Apr 17 '21 18:04 AllSeeingEyeTolledEweSew

I noticed in #6129, export_filter() doesn't emit the flags. Is this intended?

Apr 17 '21 21:04 AllSeeingEyeTolledEweSew

there is no session_handle::find_torrent() that takes an info_hash_t. Maybe there should be one, but I don't think it belongs on this list

Apr 23 '21 11:04 arvidn

This is an attempt to fix the DHT alert fields. It's not so easy to write a test for these, since you can't construct the alerts. Could you test it?

https://github.com/arvidn/libtorrent/pull/6166

Apr 23 '21 14:04 arvidn

I don't know what's going on with dht_mutable_item_alert.item

Apr 23 '21 14:04 arvidn

there is no session_handle::find_torrent() that takes an info_hash_t. Maybe there should be one, but I don't think it belongs on this list

ahh, I'd mistaken session_interface for session_handle. Removed from the list

This is an attempt to fix the DHT alert fields. It's not so easy to write a test for these, since you can't construct the alerts. Could you test it?

#6166

Responded in the PR; some stuff is still broken. Updated the list.

Apr 25 '21 08:04 AllSeeingEyeTolledEweSew

FYI I'm done writing new unit tests for now. I'll wrap them up into a PR, or several. I won't add any more items to this issue.

Apr 25 '21 08:04 AllSeeingEyeTolledEweSew

deluge files tab is affected by torrent_handle.file_progress

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/deluge/ui/gtk3/files_tab.py", line 296, in update
    component.get('SessionProxy').get_torrent_status(
  File "/usr/lib/python3.9/site-packages/deluge/ui/sessionproxy.py", line 147, in get_torrent_status
    d = client.core.get_torrent_status(torrent_id, keys_to_get, True)
  File "/usr/lib/python3.9/site-packages/deluge/ui/client.py", line 551, in __call__
    return self.daemon.call(self.base, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/deluge/ui/client.py", line 500, in call
    return defer.maybeDeferred(m, *copy.deepcopy(args), **copy.deepcopy(kwargs))
--- <exception caught here> ---
  File "/usr/lib/python3.9/site-packages/twisted/internet/defer.py", line 167, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/lib/python3.9/site-packages/deluge/core/core.py", line 762, in get_torrent_status
    return self.create_torrent_status(
  File "/usr/lib/python3.9/site-packages/deluge/core/core.py", line 742, in create_torrent_status
    status = self.torrentmanager[torrent_id].get_status(
  File "/usr/lib/python3.9/site-packages/deluge/core/torrent.py", line 1003, in get_status
    status_dict[key] = self.status_funcs[key]()
  File "/usr/lib/python3.9/site-packages/deluge/core/torrent.py", line 873, in get_file_progress
    self.handle.file_progress(), self.torrent_info.files()
Boost.Python.ArgumentError: Python argument types in
    torrent_handle.file_progress(torrent_handle)
did not match C++ signature:
    file_progress(libtorrent::torrent_handle {lvalue}, libtorrent::flags::bitfield_flag<unsigned char, libtorrent::file_progress_flags_tag, void> flags=0)

13:39:55 [CRITICAL][deluge.log                        :90  ] twisted.internet.defer 
[Failure instance: Traceback: <class 'Boost.Python.ArgumentError'>: Python argument types in
    torrent_handle.file_progress(torrent_handle)
did not match C++ signature:
    file_progress(libtorrent::torrent_handle {lvalue}, libtorrent::flags::bitfield_flag<unsigned char, libtorrent::file_progress_flags_tag, void> flags=0)
/usr/lib/python3.9/site-packages/deluge/ui/gtk3/files_tab.py:296:update
/usr/lib/python3.9/site-packages/deluge/ui/sessionproxy.py:147:get_torrent_status
/usr/lib/python3.9/site-packages/deluge/ui/client.py:551:__call__
/usr/lib/python3.9/site-packages/deluge/ui/client.py:500:call

Jun 16 '21 12:06 leigh123linux

It looks like we removed support for torrent_info() constructors from bytearray and memoryview.

That's fine, I expect no one ever used them. They were only on the list because it appeared this might have been supported at one time.

I removed them from the list

Aug 01 '21 22:08 AllSeeingEyeTolledEweSew

@arvidn Is this in the wrong milestone?

Deluge really needs the file_progress_flags fix #6369 for lt 2.0

Sep 26 '21 09:09 cas--