rqbit Remove info hash hacks and misc

I ran into this same problem in my client https://github.com/anacrolix/torrent, there I solved by implementing the equivalent of a deserializer into bytes (essentially just type checks the bencode) (https://github.com/anacrolix/torrent/blob/master/metainfo/metainfo.go#L16). I don't quite know enough Rust, but I think you either want to (in the same way as my implementation) store the info bytes directly in a metainfo so you can hash them at will (and do a subsequent deserialize if you want to turn them into an info you can work with), or as I've done here, I've stored info as a BencodeValue, which can be deterministically turned into bytes for hashing, or deserialized further into an Info. I did notice you use HashMap for BencodeValue::Dict. That shouldn't allow serializing back into a different order, but I couldn't trigger that behaviour with the unit test. I'm not sure how to best trigger this behaviour in Rust. The solution would be to have an ordered mapping type, or have Dict use Vec internally and expose some kind of mapping interface? BitTorrent does specify that dict keys must be ordered like memcmp (but not in the original BEP 3), so client support is flakey, and that's why you want to deserialize info into something that retains the original ordering, at least if you intend to generate infohashes.

Moving the infohashing out of bencode moved the sha1w dep to librqbit_core. I didn't retain the features, I assume they aren't in use (except on the main bin perhaps?), but if it was I can put them back in mapping features to core instead.

There's no way to deserialize from BencodeValue to a Rust type. I'm not well versed in serde, would that require a custom deserializer? Instead I serialized back into bytes then ran the bytes deserializer. It should still be correct, unless the Dict nondeterministic ordering strikes. At least for infohashing, there's no intermediate allocation as it can serialize straight into the hasher.

How does one deserialize into bytes with serde? If it was possible to have info: ByteBuf in the metainfo that could be optimal for hashing after deserializing. With the BencodeValue route I went, I had to use BencodeValueOwned: I couldn't work out the trait and lifetime requirements to have it be BencodeValue<ByteBuf>. Do you know why that might be? Perhaps knowing whether BencodeValue or a raw ByteBuf type could be used in metainfo to avoid allocations could decide the best choice?

I tried to add a std::io::Write for any ISha1, but Rust didn't like that, so I made a helper type then noticed you had a private one in a function somewhere that was identical.

I have been toying with the idea of a Rust implementation of my torrent client for a while, but there's a few rough spots in my knowledge of Rust. If it works for you'll I'll make some more contributions like this if you have the capacity for it.

Mar 26 '24 05:03 anacrolix

I found something like the raw value that could work, https://docs.rs/serde_json/latest/serde_json/value/struct.RawValue.html. I imagine it could be done with some kind of reliance on ByteBuf.

Mar 26 '24 05:03 anacrolix

I can take a closer look later (need to see locally with and IDE), but high-level comments so far.

First of all, thanks for the contribution. Removing hacks is good, learning Rust along the way even better!

I don't quite know enough Rust, but I think you either want to (in the same way as my implementation) store the info bytes directly in a metainfo so you can hash them at will (and do a subsequent deserialize if you want to turn them into an info you can work with), or as I've done here, I've stored info as a BencodeValue, which can be deterministically turned into bytes for hashing, or deserialized further into an Info.

Not exactly. There are 3 ways info_hash can be created for use in rqbit

Reading an existing .torrent file or its bytes - and the subsequent deserializer hack that you are trying to remove.
Reading an existing info_hash from a magnet link
create_torrent_file() function, that is only used in tests so far.

Let's focus on 1 and 2 as actual production usages. What rqbit cares about it:

frequent access to info_hash for all kinds of purposes
frequent access to info and fields in it like filenames, piece hashes etc.

What it does not care about:

serializing an already deserialized torrent
thus no need to preserve serialization order (implied from above)
thus no need to store fields that we don't care about, because we don't understand them.

I did notice you use HashMap for BencodeValue::Dict. That shouldn't allow serializing back into a different order, but I couldn't trigger that behaviour with the unit test. I'm not sure how to best trigger this behaviour in Rust. The solution would be to have an ordered mapping type, or have Dict use Vec internally and expose some kind of mapping interface? BitTorrent does specify that dict keys must be ordered like memcmp (but not in the original BEP 3), so client support is flakey, and that's why you want to deserialize info into something that retains the original ordering, at least if you intend to generate infohashes.

If order is important BTreeMap should be used. As you can see, it was never important previously, as there's no need to serialize info.

and that's why you want to deserialize info into something that retains the original ordering, at least if you intend to generate infohashes.

Another option (which I probably would prefer) is to deserialize into both:

a custom struct that has all the fields we know how to work with
something that lets you dump it back preserving the order. The easiest thing here is just plain bytes - not even BencodeValue.

So if I were to do anything about it, I would store 3 things:

The info_hash for frequent access. Computing it on every call is expensive and unnecessary - it never changes.
The TorrentMetaV1Info - for all the fields we care about.
The raw bytes of TorrentMetaV1Info - in case we ever need to restore the original in its pure form (e.g. if an extended peer is asking us to give them the content of the torrent info, this feature doesn't exist today).

Moving the infohashing out of bencode moved the sha1w dep to librqbit_core. I didn't retain the features, I assume they aren't in use (except on the main bin perhaps?), but if it was I can put them back in mapping features to core instead.

Great, let it burn! The --sha1* features are from the old times when I was measuring perf and picking the best option. But the one used right now is the clear winner, so all this cruft can be removed from everywhere.

There's no way to deserialize from BencodeValue to a Rust type. I'm not well versed in serde, would that require a custom deserializer? Instead I serialized back into bytes then ran the bytes deserializer. It should still be correct, unless the Dict nondeterministic ordering strikes. At least for infohashing, there's no intermediate allocation as it can serialize straight into the hasher.

There's no way to deserialize from BencodeValue to a Rust type: that would require a custom Serializer and Deserializer for BencodeValue, which although can be done, is overkill for the task at hand.

Instead I serialized back into bytes then ran the bytes deserializer: this is a fine "hack", but with above recommendation I don't think we need it.

How does one deserialize into bytes with serde?

Code. E.g. you create a Vec<u8> (possibly wrap into BufWriter for performance), then call bencode_serialize_to_writer on it. Vec<u8> implements std::io::Write.

If it was possible to have info: ByteBuf in the metainfo that could be optimal for hashing after deserializing. With the BencodeValue route I went, I had to use BencodeValueOwned: I couldn't work out the trait and lifetime requirements to have it be BencodeValue. Do you know why that might be? Perhaps knowing whether BencodeValue or a raw ByteBuf type could be used in metainfo to avoid allocations could decide the best choice?

It should be possible. Smth like like serde_json's RawValue which you pointed out can do this. But of course for rqbit's deserializer it would need to be implemented, we can't reuse serde_json as it is specific to that crate.

That would require adding other "hacks" to serializer and deserializer. In fact serde_json's "RawValue" is very similar to what we were doing with "info_hash" - just that our hack wasn't reusable and only served one purpose.

I would imagine it would be possible to write something like this:

struct WithRawValueAndHash<T, ByteBuf> {
    deserialized_value: T,
    raw_value: ByteBuf,
    hash: Id20
}

struct TorrentMetaV1<ByteBuf> {
    ...
    info: WithRawValueAndHash<TorrentMetaV1Info<ByteBuf>, ByteBuf>,
    ...
}

And then somehow (not sure how, need to try) to make it work automatically with BencodeSerializer and BencodeDeserializer in a similar way to serde_json::RawValue

I tried to add a std::io::Write for any ISha1, but Rust didn't like that, so I made a helper type then noticed you had a private one in a function somewhere that was identical.

If you cleaning this up anyway, we can just get rid of ISha1 generic, all the related features and multiple implementations. We can even get rid of the whole sha1w crate, and just have one "Sha1" struct librqbit_core.

I have been toying with the idea of a Rust implementation of my torrent client for a while, but there's a few rough spots in my knowledge of Rust. If it works for you'll I'll make some more contributions like this if you have the capacity for it.

Sounds good, happy to help!

Mar 26 '24 11:03 ikatson

Overall, I don't hate it, but I don't love it.

The little hack was pretty innocent but made things simpler I think - as it was all structured types without converting back and forth between bytes and serialization.

I'm not sure it's worth it without creating a general abstraction like "WithRawValueAndHash" above. I'm not sure it's worth creating that abstraction either though - hard to tell without trying.

Let me know what you think and if you agree with anything I said :)

Mar 26 '24 22:03 ikatson

I'm not sure it's worth it without creating a general abstraction like "WithRawValueAndHash" above. I'm not sure it's worth creating that abstraction either though - hard to tell without trying.

I just tried this in test-bencode-ser branch for fun. Although cool that it works, but I'm still not sure it's worth the complexity. Maybe it is.

Mar 26 '24 23:03 ikatson

Let's focus on 1 and 2 as actual production usages. What rqbit cares about it:

frequent access to info_hash for all kinds of purposes

frequent access to info and fields in it like filenames, piece hashes etc.

Agreed. If it wasn't clear, I want to expose the info_hash by doing the hashing just once. I think a wrapper around metainfo parsing that returns the metainfo, the info, and the infohash.

What it does not care about:

serializing an already deserialized torrent

thus no need to preserve serialization order (implied from above)

thus no need to store fields that we don't care about, because we don't understand them.

I agree. I did 1 just as a hack. I think it's clear that probably extracting the bytes directly for hashing is optimal. Then the order never matters as you say.

I did notice you use HashMap for BencodeValue::Dict. That shouldn't allow serializing back into a different order, but I couldn't trigger that behaviour with the unit test. I'm not sure how to best trigger this behaviour in Rust. The solution would be to have an ordered mapping type, or have Dict use Vec internally and expose some kind of mapping interface? BitTorrent does specify that dict keys must be ordered like memcmp (but not in the original BEP 3), so client support is flakey, and that's why you want to deserialize info into something that retains the original ordering, at least if you intend to generate infohashes.

If order is important BTreeMap should be used. As you can see, it was never important previously, as there's no need to serialize info.

Turns out there was a BTreeMap in the serializer, which is why I couldn't trigger a bug with bencode generated from rqbit.

and that's why you want to deserialize info into something that retains the original ordering, at least if you intend to generate infohashes.

Another option (which I probably would prefer) is to deserialize into both:

a custom struct that has all the fields we know how to work with

something that lets you dump it back preserving the order. The easiest thing here is just plain bytes - not even BencodeValue.

I suspect you can't deserialize into both an info with borrowed byte buffers, and generate a reference to the region of bytes the info was deserialized from. At least not without some refcell magic or other shenanigans.

So if I were to do anything about it, I would store 3 things:

The info_hash for frequent access. Computing it on every call is expensive and unnecessary - it never changes.

The TorrentMetaV1Info - for all the fields we care about.

The raw bytes of TorrentMetaV1Info - in case we ever need to restore the original in its pure form (e.g. if an extended peer is asking us to give them the content of the torrent info, this feature doesn't exist today).

3 is very forward thinking of you. I guess to that end, it would be okay to deserialize and end up with a copy of the bytes that the info was deserialized from. Whatever the deserializer read from probably isn't sticking around.

Code. E.g. you create a Vec<u8> (possibly wrap into BufWriter for performance), then call bencode_serialize_to_writer on it. Vec<u8> implements std::io::Write.

Ah no I meant like json's RawValue. Essentially type check the input and end up with a reference to it in the output value without doing any allocation.

I would imagine it would be possible to write something like this:
struct WithRawValueAndHash<T, ByteBuf> {
    deserialized_value: T,
    raw_value: ByteBuf,
    hash: Id20
}

struct TorrentMetaV1<ByteBuf> {
    ...
    info: WithRawValueAndHash<TorrentMetaV1Info<ByteBuf>, ByteBuf>,
    ...
}
And then somehow (not sure how, need to try) to make it work automatically with BencodeSerializer and BencodeDeserializer in a similar way to serde_json::RawValue

An interesting challenge, but it leaves you with fields that irrelevant to a serializer. Hence why I think it's more accurate to have info_bytes inside the metainfo, and then deserialize that again to get the info (and vice versa to serialize a metainfo, you serialize the info then put it in the info field).

Sounds good, happy to help!

Thanks!

Mar 27 '24 06:03 anacrolix

I'm not sure it's worth it without creating a general abstraction like "WithRawValueAndHash" above. I'm not sure it's worth creating that abstraction either though - hard to tell without trying.

I just tried this in test-bencode-ser branch for fun. Although cool that it works, but I'm still not sure it's worth the complexity. Maybe it is.

Nice. I think this approach might work but without creating the object too, for the reason I gave about not having stuff specific to serializer or deserializer. Essentially RawValue would let you walk across bencode values without doing any allocation.

Mar 27 '24 06:03 anacrolix

Turns out there was a BTreeMap in the serializer

Ah, right, I remember doing it now. I did it this way, so that it doesn't matter if the passed-in object is a HashMap or BTreeMap (cause we can't distinguish them in the serializer, I guess).

I suspect you can't deserialize into both an info with borrowed byte buffers, and generate a reference to the region of bytes the info was deserialized from.

If you mean self-referential structs, then yes it's impossible to do with safe Rust. E.g.

struct Torrent {
info_bytes: Vec<u8>,
info: TorrentMetaInfo<... this can't reference "info_bytes">
}

However, there are 2 things that could be done theoretically, I'm not saying we need it:

For borrowed structs (that reference the original bencode binary), it's not a problem at all, as you can have an &[u8] both in "info_bytes" and "info".
For owned structs, one could use smth like the "bytes" crate, which exposes a reference-counted blob of bytes that you can slice.

I guess to that end, it would be okay to deserialize and end up with a copy of the bytes that the info was deserialized from. Whatever the deserializer read from probably isn't sticking around.

Either a copy (which isn't a huge problem), or smth like "bytes" above.

Essentially type check the input and end up with a reference to it in the output value without doing any allocation.

That's why everything is generic over "BufT", so that you can have both owned and borrowed versions. But the owned one doesn't do reference-counted byte blobs.

Essentially RawValue would let you walk across bencode values without doing any allocation.

You could even today, but we it will be useless cause we can't store it.

E.g if we stored &[u8] or Vec with metainfo bytes, we can deserialize it into a TorrentMetaV1Info<&'a [u8]> (it's called TorrentMetaV1InfoBorrowed). But then we can't store it anywhere as everything now has a lifetime. And it's an object that is very frequently consulted.

Mar 27 '24 11:03 ikatson

I'm not sure it's worth it without creating a general abstraction like "WithRawValueAndHash" above. I'm not sure it's worth creating that abstraction either though - hard to tell without trying.

I just tried this in test-bencode-ser branch for fun. Although cool that it works, but I'm still not sure it's worth the complexity. Maybe it is.

Thanks for this. Your implementation is much better than JSON's, I don't know why theirs is so convoluted. I learned a lot from it.

Mar 27 '24 11:03 anacrolix

So after thinking about it for a bit here's what I see:

I think the existing "f**ing hack" as I called it in the code is alright, not too bad.
Replacing it with another hack (of storing BencodeValue, and then serializing, deserializing it again) doesn't seem any better.
"WithRawBytes" seems less than a hack, so it can be used. But the sheer amount of boilerplate and clever stuff needed to make serde fill it in for you makes it questionable - why is this needed at all.

So I see 3 options on the table:

Do nothing.
Merge your PR. What does it give us?
Finish WithRawBytes and merge that. What does it give us? Only future-proofing for features that would need the ability to restore the original bytes as they were without relying on serializing code being 100% byte-to-byte reproducible.

If we do something and end up storing 2 copies of it, might as well see if we can use "bytes" crate for reference counting byte blobs (which can get pretty big for large torrents with many pieces).

I'm leaning to either 1 or 3. I'm not super against 2 though. Wdyt?

Mar 27 '24 11:03 ikatson

I'm leaning toward 3 but without holding an intermediate value. I will update the PR in a day or two.

Mar 27 '24 11:03 anacrolix

Sounds good @anacrolix , I'll leave it up to you.

Mar 27 '24 12:03 ikatson

I made a good mess that I want to clean up when I get prove it will compile. I mentioned where I'm stuck above.

The other difficulty I can see is if you want to be backward compatible with session states from old versions, how would you go about ensuring that the info field can be deserialized from the form it's currently in on disk (base64 of JSON), or whatever it is, it should probably be ignored when it fails to deserialize, and then when the state is reserialized, include the info in its raw byte form. I got an error running with a session from an older version of rqbit, and the torrent didn't show up in the UI, but the app continued to work and when I added the torrent again it serialized it correctly. Not sure how you would want to handle upgrading from old session states, or if you would just flag it as a breaking change in the release notes.

Apr 03 '24 02:04 anacrolix

A few things:

I don't suppose you could check this branch out in your IDE and help me solve the single compile error I have left?

There you go https://github.com/ikatson/rqbit/commit/b345dc5933dcae0ffeb19ca5c97c9ab5dc47876e

how would you go about ensuring that the info field can be deserialized from the form it's currently in on disk

It uses custom functions to serialize/deserialize today - serialize_torrent/deserialize_torrent. You can just update them to try both formats (new and old). Let's not break backwards compat for no good reason.

Will put some more comments into the PR itself

Apr 03 '24 13:04 ikatson

A similar thing was merged in recently, so this is no longer needed. We now store reference-counted bytes, both the full info, and the deserialized one

Aug 14 '24 14:08 ikatson

rqbit rqbit copied to clipboard

Remove info hash hacks and misc

rqbit
rqbit copied to clipboard