intermodal [Possible Bug] imdl fails to deserialize torrents with .utf-8 key variants

Hi,

imdl fails to deserialize torrents which have .utf-8 key variants name.utf-8 and path.utf-8.

They are not explicit defined in BEP3 but seem to be introduced by BitTorrent Inc. and used in uTorrent also see this.

To explain the thing shortly:

Torrents of this type hold e.g. the name key and a name.utf-8 key. The encoding of the value of this dict entries are different. So they have two variants of name (think it's ASCII or the system default encoding and UTF-8 in the .utf-8 key). BEP3 normally says they should be UTF-8 always. This is also usual in files dict entries with path and path.utf-8 which is the same like in the case of the name key.

torf accepts this without problems (files are simply valid).

Can be fixed by rewriting the torrent e.g. with torf by using the value of the .utf-8 fields and simply remove the .utf-8 variants after this.

Because it seems to not violating BEP3 directly but is used in practice I guess it's the best to behave like clients do: If the .utf-8 variant exists use it preferred.

Could provide a sample from the wild world out there via Discord.

Best Regards

Mar 06 '24 18:03 DerBunteBall

Thanks for the report! I'd be fine if someone created a PR supporting this, although it might, in practice, be quite a messy PR. imdl relies on serde for serializing and deserializing bencode, and I think it would be a bit hard to make it accept both and fall back to the .utf-8 fields.

Mar 06 '24 20:03 casey

I think maybe the best way to support this would be to have a fix command, which could fix common problems with torrents, and avoids using serde. I.e., it uses raw bencode deserialization, and upon encountering a field that ends with .utf-8 replaces the non .utf-8 fields.

Mar 06 '24 20:03 casey

The torrent dump command outputs them normally. But show and all others fail. I think dump command uses plain Bencode.

It can be fixed by putting the values of .utf-8 fields into the normal ones.

In torf something simple like this did it for me:

#!/usr/bin/env python3

import sys
from torf import Torrent

def main(argv=None):
    t = Torrent().read("my_torrent.torrent")
    for num, my_file in enumerate(t.metainfo["info"]["files"]):
        t.metainfo["info"]["files"][num]["path"] = my_file["path.utf-8"]
        my_file.pop("path.utf-8")
    t.metainfo["info"]["name"] = t.metainfo["info"]["name.utf-8"]
    t.metainfo["info"].pop("name.utf-8")
    t.write("my_torrent_fixed.torrent")

if __name__ == "__main__":
    main(sys.argv)

Mar 06 '24 20:03 DerBunteBall

Just removing doesn't help by the way. Because of the fact that in this case the encoding in name field can be everything this leads to the same error. So UTF-8 seems to be strictly expected by serde.

Mar 06 '24 20:03 DerBunteBall

imdl should be able to handle these things natively.

The reason is that files that get modified like this have new info hashes. This is true for the variant where optional md5sum keys contain invalid data (not plain strings) or .utf-8 key variants. That's at least the implementation state for now. I guess it's possible to do it in anthoer way but this could be a violation of specification.

So the fix above leads to a torrent with another infohash then the original but can be verified due to the fact that the piece hashes aren't touched. torf creates a new info hash when writing out the file by hashing the info dict. I guess there is no real specification for torrent modification so I guess it would be possible to just modify the torrent without changing the info hash (just modify the bencode).

BUT the changing of the info hash will be confusing. Because in the case where only the bencode is modified the hash wouldn't be valid for the info dict.

Mar 11 '24 12:03 DerBunteBall

I think this would just be too complex for the implementation. imdl uses serde, and I can't think of a simple way to substitute values from .utf-8 keys when they are present.

May 14 '24 21:05 casey