Python bindings: various apparent gaps in type signature mappings
There are a few functions in the python bindings that seem like they're intended to work given the source code, but invoking the bindings doesn't work or doesn't produce the expected result.
- [x]
create_torrent.add_url_seed("str arg"): produces unexpected results in output ofcreate_torrent.generate(). maybe the input got mapped incorrectly? - [x]
create_torrent.add_http_seed("str arg"): produces unexpected results in output ofcreate_torrent.generate(). maybe the input got mapped incorrectly? - [x]
create_torrent.add_collection("str arg"): produces unexpected results in output ofcreate_torrent.generate(). maybe the input got mapped incorrectly? - [x]
create_torrent.set_root_cert("str arg"): produces unexpected results in output ofcreate_torrent.generate(). maybe the input got mapped incorrectly? I'm not sure if this function should be deprecated, or if it should be allowed, so as to accept PEM data as astr. - [x]
fingerprint.name: accessing the attribute raisesTypeError - [x]
ip_filter.export_filter(): raisesTypeError - ~~
torrent_info(bytearray(bencode({b"info": {...}}))): raisesRuntimeError("torrent file is not a dictionary"), though the source code implies this should work~~ - ~~
torrent_info(memoryview(bencode({b"info": {...}}))): raisesRuntimeError("torrent file is not a dictionary"), though the source code implies this should work~~ - [x]
torrent_info(info_hash_t(sha1_hash(b"a" * 20))): raisesRuntimeError("torrent file is not a dictionary"), though this should probably work - [x]
torrent_info.add_tracker("str arg")does not match a type signature, though source code indicates it should allow the other arguments to be defaults - [x]
torrent_info.add_url_seed("str arg")does not match a type signature, though source code indicates it should allow the other arguments to be defaults - [x]
torrent_info.add_url_seed("str arg", "str arg", [])does not match a type signature, though source code indicates it should work - [x]
torrent_info.add_http_seed("str arg")does not match a type signature, though source code indicates it should allow the other arguments to be defaults - [x]
torrent_info.add_http_seed("str arg", "str arg", [])does not match a type signature, though source code indicates it should work - [x]
torrent_info.file_at_offset(0)raisesTypeError - [x]
session_status.active_requestsraisesTypeError - [x]
add_torrent_params.unfinished_pieces = {}does not match a type signature - [x]
add_torrent_params.renamed_files = {}does not match a type signature - [x]
add_torrent_params.merkle_tree = []does not match a type signature - [x]
session.dht_announce(sha1_hash(b"a"*20), 0, 0)does not match a type signature - [x]
announce_flagsargument (fordht_announce()) doesn't seem to be mapped anywhere, unless I missed it - ~~
session.find_torrent(info_hash_t(...))isn't bound, but seems to exist in C++ code~~ - [x]
session.add_port_mapping(...)raisesTypeErrorapparently due to being unable to type-map the return value - [x]
torrent_handle.file_progress()does not match a type signature - [x]
peer_disconnected_alert.reasondoes not match a type signature - [x]
picker_log_alert.picker_flagsdoes not match a type signature - [x]
picker_log_alert.blocks()does not match a type signature - [x]
dht_log_alert.moduledoes not match a type signature - [x]
dht_immutable_item_alert.itemraises "invalid type requested from entry" - [x]
dht_mutable_item_alert.keyraises "no Python class registered" - [x]
dht_mutable_item_alert.itemraises "invalid type requested from entry" - [x]
dht_mutable_item_alert.signatureraises "no Python class registered" - [x]
dht_put_item_alert.public_keyraises "no Python class registered" - [x]
dht_put_item_alert.signatureraises "no Python class registered"
I'm finished writing new unit tests and won't add any more items to this issue.
there should be unit tests for these in test.py. when you say "unexpected output" what do you mean exactly?
Is the string not treated as unicode?
in .torrent files, strings are supposed to be encoded as utf-8 if they are meant to be human readable, and raw-bytes otherwise.
So, piece hashes are binary, name, filenames, creator and comment are utf-8.
It's not obvious to me that the character strings functions should accept bytes as well, but perhaps there's something I'm not considering.
there should be unit tests for these in
test.py. when you say "unexpected output" what do you mean exactly?
In master on Linux:
>>> import libtorrent as lt
>>> fs = lt.file_storage()
>>> fs.add_file("test.txt", 1024)
>>> ct = lt.create_torrent(fs)
>>> ct.set_hash(0, b"a" * 1024)
>>> ct.add_url_seed("http://example.com")
>>> ct.generate()
{b'creation date': 1617038247, b'info': {b'length': 1024, b'name': b'test.txt', b'piece length': 16384, b'pieces': b'aaaaaaaaaaaaaaaaaaaa'}, b'url-list': b'h\x00\x00\x00t\x00\x00\x00t\x00\x00\x00p\x00\x00\x00:\x00\x00\x00/\x00\x00\x00/\x00\x00\x00e\x00\x00\x00x\x00\x00\x00a\x00\x00\x00m\x00\x00\x00p\x00\x00\x00l\x00\x00\x00e\x00\x00\x00.\x00\x00\x00c\x00\x00\x00o\x00\x00\x00m\x00\x00\x00'}
in .torrent files, strings are supposed to be encoded as utf-8 if they are meant to be human readable, and raw-bytes otherwise.
So, piece hashes are binary, name, filenames, creator and comment are utf-8.
It's not obvious to me that the character strings functions should accept bytes as well, but perhaps there's something I'm not considering.
utf-8 is ideal, but not enforced. In the wild I find a lot of cp1252 (or subsets or lookalikes, it's impossible to know), including a significant portion of "blessed" torrents on private trackers.
Even a careful, BEP-aware torrent creator may want to use strings of non-utf-8 or unknown encoding:
- A tool that modifies or re-encodes an existing torrent may want to pass-through existing strings of unknown encoding, or reuse them elsewhere, especially
collections - A tool that creates torrents intending that they will be decoded correctly by a target implementation/platform may want to intentionally encode everything as cp1252
- Posix platforms treat filenames as opaque binary. Tools may want to pass through the opaque data, or generate non-utf-8 filenames for use by some downstream software.
I think strings with unknown encoding are here to stay, because
- BEP 52 says that utf-8 MUST be used for "human-readable" data. This appears to only apply to filenames, but for those it says that utf-8 is only used IF the encoding is "known at creation time", implying arbitrary data MAY be used even for filenames.
- I can't find any BEP that explicitly calls out any string as "human-readable" in any case. It seems fair for an implementation to decide that
collectionsorcreatormay be binary hash values. In the real world, filenames are read by software as well as humans, and "human-readabe" is not the lowest common denominator between the two. - As above, filenames on posix platforms never have "known" encoding (except if the torrent creator also created the files themselves). They can't be reliably converted to utf-8.
- This is not "solved" on windows, as its wide-char filenames can include invalid surrogate pairs, which also can't be converted to utf-8.
- There is no reliable test for whether a string is utf-8. It has no explicit signature, and can be ambiguous with other encodings. Even if an implementation thinks a string is probably utf-8 but not explicitly marked, the careful approach to treat it as a string of unknown encoding. So a careful implementation must treat all strings in a BEP-compliant torrent as unknown encoding, even if it thinks the creator would have used utf-8 when "known".
- Even a careful, BEP-compliant implementation can't explicitly mark that utf-8 was used in torrent creation.
name.utf-8andpath.utf-8exist, but appear to be nonstandard, and I can't see that the pattern has been used for other strings likecomment. When the knowledge of a string's encoding is lost, a new string of unknown encoding is created. - Even if a new BEP solved encoding problems, new software can't reject invalid or ambiguously encoded torrents, as there are too many in the wild. Private trackers and "the scene" have arcane practices around fixing data; I have seen them refuse to fix e.g. obvious typo's in filenames, if the typo is part of a scene group's official release
- Guessing or clobbering encodings and filenames is common practice with bittorrent software today, as a result of these and other issues. But this creates compatibility hell, where software must maintain its guesses and clobbers in future releases, and be compatible with those of others. It's far better to guess and clobber as little as possible, and handle strings as having unknown encoding if encoding isn't explicitly specified.
- Even if guessing and clobbering works seamlessly 99% of the time, my overwhelming experience is that the other 1% of cases has their impact amplified by distorted expectations. Either developers miss test cases since they're so narrow, or users expect it to always work since it usually works.
Zooming out, I've tried to research prior art to follow patterns for how the python bindings should deal with string encodings here, but I think correct solutions are always going to be awkward because the BEPs have impedence mismatch with python.
Since 3.0 python brightened the line between str and bytes, i.e. character strings of explicit encoding vs opaque binary data. The BEPs' approach to encodings seems to make more sense in the C/C++ world, where string decoding is more rare, more ad-hoc, and is often done just for user presentation, so the user can correct errors visually or specify a different encoding.
Unfortunately I think the BEPs are so loosely written that it's pretty inevitable that implementations will continue to create a lot of weird data, whether for good reasons or ambiguity. So a feature-complete bittorrent implementation has a lot of work to do to support the lowest common denominator.
>>> import libtorrent as lt
>>> fs = lt.file_storage()
>>> fs.add_file("test.txt", 1024)
>>> ct = lt.create_torrent(fs)
>>> ct.set_hash(0, b"a" * 1024)
>>> ct.add_url_seed("http://example.com")
>>> ct.generate()
The unexpected part is that the strings are bytes? I think that makes sense, especially given all the arguments you made in favor of being encoding agnostic. But also, at the bencoding layer, you don't know whether a string of bytes should be interpreted as bytes are a utf-8 string.
It's this part: b'url-list': b'h\x00\x00\x00t\x00\x00\x00t\x00\x00\x00p.... It's like each character got cast to an int32 or something.
ah, yes. That doesn't look right
I believe this would fix ~~most~~ some of these issues. All the ones where a function takes a string_view at least.
https://github.com/arvidn/libtorrent/pull/6126
torrent_info.file_at_offset() is deprecated and returns an internal (also deprecated) type. I don't think it would make sense to make this work. Since it doesn't work now, likely, barely anybody is using it. I'd like to keep it that way so I think just removing it from the binding is the best option.
same thing goes for add_http_seed() actually. It's also deprecated in recent versions of libtorrent, and if it isn't working now, I don't think it should be fixed.
torrent_info.file_at_offset()is deprecated and returns an internal (also deprecated) type. I don't think it would make sense to make this work. Since it doesn't work now, likely, barely anybody is using it. I'd like to keep it that way so I think just removing it from the binding is the best option.
same thing goes for
add_http_seed()actually. It's also deprecated in recent versions of libtorrent, and if it isn't working now, I don't think it should be fixed.
Seems good, especially if these were also deprecated in RC_1_2
https://github.com/arvidn/libtorrent/pull/6129 fixes the ip_filter issue
fingerprint.name is deprecated in 1.2.x, so rather than fixing it I will remove it
https://github.com/arvidn/libtorrent/pull/6130
I'm not convinced about set_creator() and set_comment() taking bytes. that seems wrong to me.
I suggested it to be consistent with add_collection() which takes bytes.
I think it's also reasonable to deprecate add_collection() taking bytes for consistency, if the intent of create_torrent is to enforce standards like utf-8 everywhere.
Or we could do neither, but even in this case I thought it's a good idea to have an intentional plan, and some unit tests
https://github.com/arvidn/libtorrent/pull/6149
https://github.com/arvidn/libtorrent/pull/6150
@arvidn I realized that #6126 may break more than you intend. torrent_info(bytes) had previously treated the input as a filename, but now it treats it as bencoded data.
I actually think it's a really good feature (it's good for bdecoding to happen in C++, rather than python).
But this may break some valid usage today. Python has good support of using bytes pathnames. Pathnames always can be converted to str with os.fsdecode but callers would need to do this explicitly. (See #5984 for more info)
Options I can think of:
- My preference: keep
torrent_info(bytes)as a constructor from pathname, and add a factory function. A common pattern in python istorrent_info.from_bencoded(bytes), not sure if boost.python makes that easy or not. Otherwise a top-level function would be fine. It would also be nice iftorrent_info.from_bencoded(memoryview)worked:memoryviewis likevoid*in python, and allows zero-copy data passing. On the cpython side, there's a "buffer protocol" for zero-copy access of a buffer-like python object. It would be nice iftorrent_info.from_bencoded(...)took any argument that supported the buffer protocol - Use
torrent_info(bytearray)andtorrent_info(memoryview)as constructors from bencoded data, andtorrent_info(bytes)as a constructor from pathname.
Also, #6126 (or something else) broke create_torrent.set_root_cert(bytes). Was this intended? I haven't figured out what the right test case is meant to be
create_torrent.add_collection(bytes) is also broken. Maybe it should be removed per https://github.com/arvidn/libtorrent/issues/5984#issuecomment-820767114 , but let's deprecate it first
I noticed in #6129, export_filter() doesn't emit the flags. Is this intended?
there is no session_handle::find_torrent() that takes an info_hash_t. Maybe there should be one, but I don't think it belongs on this list
This is an attempt to fix the DHT alert fields. It's not so easy to write a test for these, since you can't construct the alerts. Could you test it?
https://github.com/arvidn/libtorrent/pull/6166
I don't know what's going on with dht_mutable_item_alert.item
there is no
session_handle::find_torrent()that takes aninfo_hash_t. Maybe there should be one, but I don't think it belongs on this list
ahh, I'd mistaken session_interface for session_handle. Removed from the list
This is an attempt to fix the DHT alert fields. It's not so easy to write a test for these, since you can't construct the alerts. Could you test it?
#6166
Responded in the PR; some stuff is still broken. Updated the list.
FYI I'm done writing new unit tests for now. I'll wrap them up into a PR, or several. I won't add any more items to this issue.
deluge files tab is affected by torrent_handle.file_progress
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/deluge/ui/gtk3/files_tab.py", line 296, in update
component.get('SessionProxy').get_torrent_status(
File "/usr/lib/python3.9/site-packages/deluge/ui/sessionproxy.py", line 147, in get_torrent_status
d = client.core.get_torrent_status(torrent_id, keys_to_get, True)
File "/usr/lib/python3.9/site-packages/deluge/ui/client.py", line 551, in __call__
return self.daemon.call(self.base, *args, **kwargs)
File "/usr/lib/python3.9/site-packages/deluge/ui/client.py", line 500, in call
return defer.maybeDeferred(m, *copy.deepcopy(args), **copy.deepcopy(kwargs))
--- <exception caught here> ---
File "/usr/lib/python3.9/site-packages/twisted/internet/defer.py", line 167, in maybeDeferred
result = f(*args, **kw)
File "/usr/lib/python3.9/site-packages/deluge/core/core.py", line 762, in get_torrent_status
return self.create_torrent_status(
File "/usr/lib/python3.9/site-packages/deluge/core/core.py", line 742, in create_torrent_status
status = self.torrentmanager[torrent_id].get_status(
File "/usr/lib/python3.9/site-packages/deluge/core/torrent.py", line 1003, in get_status
status_dict[key] = self.status_funcs[key]()
File "/usr/lib/python3.9/site-packages/deluge/core/torrent.py", line 873, in get_file_progress
self.handle.file_progress(), self.torrent_info.files()
Boost.Python.ArgumentError: Python argument types in
torrent_handle.file_progress(torrent_handle)
did not match C++ signature:
file_progress(libtorrent::torrent_handle {lvalue}, libtorrent::flags::bitfield_flag<unsigned char, libtorrent::file_progress_flags_tag, void> flags=0)
13:39:55 [CRITICAL][deluge.log :90 ] twisted.internet.defer
[Failure instance: Traceback: <class 'Boost.Python.ArgumentError'>: Python argument types in
torrent_handle.file_progress(torrent_handle)
did not match C++ signature:
file_progress(libtorrent::torrent_handle {lvalue}, libtorrent::flags::bitfield_flag<unsigned char, libtorrent::file_progress_flags_tag, void> flags=0)
/usr/lib/python3.9/site-packages/deluge/ui/gtk3/files_tab.py:296:update
/usr/lib/python3.9/site-packages/deluge/ui/sessionproxy.py:147:get_torrent_status
/usr/lib/python3.9/site-packages/deluge/ui/client.py:551:__call__
/usr/lib/python3.9/site-packages/deluge/ui/client.py:500:call
It looks like we removed support for torrent_info() constructors from bytearray and memoryview.
That's fine, I expect no one ever used them. They were only on the list because it appeared this might have been supported at one time.
I removed them from the list
@arvidn Is this in the wrong milestone?
Deluge really needs the file_progress_flags fix #6369 for lt 2.0