borsh-rs icon indicating copy to clipboard operation
borsh-rs copied to clipboard

Enforce canonicity for HashMap and BinaryHeap

Open matklad opened this issue 2 years ago • 7 comments

Binary heap serialization is very wrong:

https://github.com/near/borsh-rs/blob/219be37d92eb5402fc597a4b57d64f2af2c076cf/borsh/src/ser/mod.rs#L277-L279

As per documentation,

Returns an iterator visiting all values in the underlying vector, in arbitrary order.

This is exactly the thing that borsh should make sure never happens, it absolutely must guarantee serialization in a specific order. Fixing this would be a breaking change for serialization format for binary heaps :( Feels we need to do it though -- I doubt anyone really uses binary heaps with borsh. Actually, I think it's better to remove impl for BinaryHeap altogether -- it's not the kind of data strcuture you should be using for serialization anyway, and by elevating this to a compile time error we make sure that we won't subtly break anyone.

HashMap deserialization is subtly wrong:

https://github.com/near/borsh-rs/blob/219be37d92eb5402fc597a4b57d64f2af2c076cf/borsh/src/de/mod.rs#L365-L376

When we serialize a map, we sort the key-value sequence. We should check on deserialization that the sequence is sorted and emit an error if that's no the case. That's technically a breaking bug fix, though shouldn't affect things in practice, as we do sort stuff on serialization.

matklad avatar Dec 22 '21 15:12 matklad

@austinabell am I correct in my assessment that no one actually tries to borsh-serialize binary heaps?

matklad avatar Dec 22 '21 15:12 matklad

@austinabell am I correct in my assessment that no one actually tries to borsh-serialize binary heaps?

Not that I am aware of, but that doesn't mean it isn't possible that there is usage in the community. I don't think it's a huge concern though because if they absolutely need this type to be serialized, they can just wrap it and create their own implementation that mimics this old behaviour or something.

As for HashMap, it's actually used a lot around the SDK and even in community unfortunately because it used to be promoted as the way to store small maps (still even used in our examples). If the deserialization order is checked, this will break a lot of things if we update. The issue also is just that within the NFT standard, HashMap is used as part of the NFT standard trait.

I did try to purge all HashMap usages from SDK before, can't remember why I didn't end up PRing, possibly the breaking change to NFT, but it's probably worth the breaking change

austinabell avatar Dec 22 '21 16:12 austinabell

If the deserialization order is checked, this will break a lot of things if we update.

Why would this be the case? It'll only break cases where you manually construct a Vec<u8> with a representation that is different from what you get when you serialize map. We do sort on serialization.

matklad avatar Dec 22 '21 16:12 matklad

If the deserialization order is checked, this will break a lot of things if we update.

Why would this be the case? It'll only break cases where you manually construct a Vec<u8> with a representation that is different from what you get when you serialize map. We do sort on serialization.

ohhh my apologies, you are correct that this wouldn't break. I'm not quite sure there why I assumed this sorting didn't happen, and skipped over that detail in the description.

Do I have it correctly that you would like to remove the impl for binary heap and add the check in deserialization of HashMap and HashSet?

A negative I see is that there will be extra costs to verifying order on deserialization that isn't necessary in most cases, especially if a contract developer is certain of the ordering.

Also, I can see there being an issue with the fact that serialized bytes that were constructed in the wrong order will be asymmetric where the bytes after deserialize/serialize aren't equal, but there are also cases where one might want to serialize a vec of elements and then deserialize as a set/map to just use unique values. In this case it would be breaking and is a pattern that might be expected to work since it does with other serialization protocols and the ordering constraint might not be clear.

I'm for this change by the way, just want to make sure we consider the counter points.

austinabell avatar Dec 22 '21 18:12 austinabell

A negative I see is that there will be extra costs to verifying order on deserialization that isn't necessary in most cases, especially if a contract developer is certain of the ordering.

Yeah, that's true -- if you are certain of the ordering, and you would love to avoid incurring the costs, you need to use Vec<_> directly.

Also, I can see there being an issue with the fact that serialized bytes that were constructed in the wrong order will be asymmetric where the bytes after deserialize/serialize aren't equal

Yeah, that's the positive motivation for this change. Specifically, the current behavior might cause pretty bad bugs when you require to sign/encrypt messages, and also want to enforce idempotence: you can encode the same message in two ways, and get different signatures.

In this case it would be breaking and is a pattern that might be expected to work since it does with other serialization protocols and the ordering constraint might not be clear.

Yeah, that's true -- you'd have to write your own type to deserialize with duplicate elimination. But that's the reason why borsh exists -- in a sense, borsh is "bincode + canonicity", if you don't need canonicity, you can use bincode.

matklad avatar Dec 23 '21 15:12 matklad

I have implemented that BinaryHeap when I was adding BTreeMap support in #6. I should not have done that. I vote for removing support for BinaryHeap since indeed I don't think anyone would actually use it.

frol avatar Dec 23 '21 16:12 frol

Yeah, that's true -- if you are certain of the ordering, and you would love to avoid incurring the costs, you need to use Vec<_> directly.

The only part to this that I do want to stress is that people currently using any map or set in storage in the SDK, or otherwise, will be incurring more costs subtly without any change on their end.

I wonder if it's worth considering having this order checked deserialization as a feature. The negative to having it as a default is just that you can't actually hit the case of asymmetric (de)ser unless you use another type like Vec. This adds overhead for every usage of these, which maybe the canonical attribute is imperative, but seems like it's more of a reason to check the ordering only when you are accepting external input like the case of signatures on a set.

austinabell avatar Dec 23 '21 18:12 austinabell

taking this issue

dj8yfo avatar Jun 23 '23 13:06 dj8yfo