libtorrent Python bindings: add bytes accessors for data

A number of fields in the torrent protocol (or libtorrent) are encoding-agnostic byte strings.

We should ensure there's always a way to access bytes, if bytes may be passed as input, or if the underlying data is encoding-agnostic.

One pattern in python for this is "runtime templating", where the type of output changes with the type of input. For instance, os.listdir(".") returns a list of str, while os.listdir(b".") returns a list of bytes. However, the python bindings have historically allowed either bytes or str as input for char * or std::string, but tend to always return str. If we make output type vary with input, we'll probably break a lot of clients.

The only alternative I can think of is have *_bytes versions of functions.

Here's a proposed list of functions returning str, which should have pair functions returning bytes:

EDIT: I changed my mind about most of these, realizing that the underlying C++ logic munges inputs and presents utf-8 strings. generate_fingerprint() is the only one that seems relevant now.

~~file_storage.symlink(): symlink_bytes()~~
~~file_storage.file_path(): file_path_bytes()~~
~~file_storage.file_name(): file_name_bytes()~~
~~file_storage.name(): name_bytes()~~
[x] generate_fingerprint(): generate_fingerprint_bytes()
~~torrent_info.web_seeds(): web_seeds_bytes() (the "auth" field may be binary data)~~
~~torrent_info.name(): name_bytes()~~
~~torrent_info.creator(): creator_bytes()~~
~~torrent_info.comment(): comment_bytes()~~
~~torrent_info.collections(): collections_bytes()~~

I'll add more as I find them.

Adding these new functions should be a non-breaking change.

Thoughts on file_storage

EDIT: never mind

~~There's an argument that the fields of file_storage should be given "pathname treatment", and use the filesystem encoding as detailed in #5984. This would make sense when creating torrents off the local filesystem.~~

~~However, currently file_storage is used in lots of different contexts. A torrent found randomly on the internet and converted into a torrent_info has a file_storage whose bytes may represent an encoding from some other system.~~

~~So I think the best thing is to add *_bytes() accessors to file_storage, and clients can apply whatever decoding (or guesses, with fallbacks) as necessary.~~

~~Notably, that means that the proper usage for creating files from the filesystem is a little funky:~~

fs = lt.file_storage()
lt.add_files(fs, my_root_dir)
for i in fs.num_files():
    # file_storage is abstract, but we know the encoding came from the filesystem
    print(os.fsdecode(fs.file_path_bytes(i)))

~~The only alternative I can think of here is that file_storage could hold a flag to remember how its inputs are encoded, such that fs.file_path(i) is the same as os.fsdecode(fs.file_path_bytes(i)) if fs was populated from the filesystem, but I can't see any way to add this feature without making it confusing and error prone.~~

Feb 17 '21 20:02 AllSeeingEyeTolledEweSew

Also: URIs and HTTP header names and values are meant to be limited to ASCII by various underlying protocols, so IMO it's not important to make these data accessible as bytes. We can do this for API consistency if we want, though.

Feb 18 '21 01:02 AllSeeingEyeTolledEweSew

If my perspective in https://github.com/arvidn/libtorrent/issues/5984#issuecomment-820767114 is right, I'll close this.

generate_fingerprint() returning bytes is the only ask that would still be legitimate, but I don't feel strongly about it

Apr 16 '21 18:04 AllSeeingEyeTolledEweSew

It seems right to return bytes there. The tests should also make sure bytes could be used to set the peer id

Apr 17 '21 13:04 arvidn

Per a lot of discussion, I stripped this down to just add a bytes version of generate_fingerprint(). I did that in #6349

Jul 30 '21 16:07 AllSeeingEyeTolledEweSew