borsh-rs feat: add BigDecimal support

feat: add BigDecimal support

Open telezhnaya opened this issue 2 years ago • 14 comments

I need it for our new version of Indexer https://github.com/telezhnaya/near-lake-flows-into-sql-base

Mar 22 '22 17:03 telezhnaya

@matklad @frol thanks a lot for the reviews! Please have a look again

Mar 23 '22 13:03 telezhnaya

This moves in the right direction, but I think it still doesn’t enforce canonicity/consistency/1-to-1

Consistency means there is a bijective mapping between objects and their binary representations. There is no two binary representations that deserialize into the same object. This is extremely useful for applications that use binary representation to compute hash;

in other words, we care about two properties:

if a == b, then serialized(a) == serialized(b). This one we care super strongly about.

if deserialize(a) == deserialize(b), then a == b. This one we care about strongly, but there are a couple of open bugs about that.

The current implementation seems to violate both. In particular, if we look at the impl for Eq for big decimal, we see that it canonicalizes the representation on comparison:

https://github.com/akubera/bigdecimal-rs/blob/272123eddc02aad5d1a331ded8eb6605ed2bea96/src/lib.rs#L797

so it seems that we need to do some extra legwork on outside to achieve canonicity. I think what we want to achieve is to use to_le_bytes serialization, and enforce that both the first and the last serializes bytes are non-zero.

Mar 23 '22 14:03 matklad

@matklad

In particular, if we look at the impl for Eq for big decimal, we see that it canonicalizes the representation on comparison: ...

Am I right that eq could be simplified by invoking normalized on both parameters and then just simply comparing the fields?

to use to_le_bytes serialization

You mean to_bytes_le, right?

and enforce that both the first and the last serializes bytes are non-zero.

Sorry I can't catch this idea. Do we need non-zero the first and the last bytes because we need to know the borders? What if we have zero byte in the middle of the vector?

Mar 23 '22 18:03 telezhnaya

oh oops, forgot to note the canonical form of these. Let me look more in depth now

Sorry I can't catch this idea. Do we need non-zero the first and the last bytes because we need to know the borders? What if we have zero byte in the middle of the vector?

The issue is just if there are zeros padding the right side of the le bytes. [1, 0] is the same as [1] and [1, 0, 0]

You can't assert the first byte is non-zero because 256 in le bytes is [0, 1]

Mar 23 '22 20:03 austinabell

Am I right that eq could be simplified by invoking normalized on both parameters and then just simply comparing the fields?

yup, that would be logically correct but (I think) would be slower than the current approach (level on confidence: 0.6)

Mar 23 '22 20:03 matklad

Please have a look again

Mar 24 '22 15:03 telezhnaya

I added the check you wanted

Mar 25 '22 06:03 telezhnaya

Current state doesn't ensure canonicity for bigints -- there are two ways to encode 0.

I've added a fuzz-test to uncover this case automatically, but it hit overflow in the underlying bigdecimal library.

See the repros in

https://github.com/matklad/borsh-rs/commit/a6e50f57883848874c5b9daf017b5f3c3bb9442c

Mar 29 '22 13:03 matklad

I just want to sum it up for the future: we decided that we don't need BigDecimal support in the indexer, and thus this PR stuck with some of the concerns from the reviewers not being addressed. If someone else needs it in the future, feel free to take it from here.

Aug 04 '22 11:08 frol

I just want to sum it up for the future: we decided that we don't need BigDecimal support in the indexer, and thus this PR stuck with some of the concerns from the reviewers not being addressed. If someone else needs it in the future, feel free to take it from here.

I think it's worth doing, seems like a common use case that someone would want to use BigIntegers for contracts.

my plan for if I do (make a comment if you disagree with anything):

Make canonical BigUInt 0 integer be serialized as [] ([0,0,0,0] in borsh because of unfortunate u32 length serialization), reject padding
Split feature into num_bigint and bigdecimal and add support for BigUInt
Explore maximums that can be serialized with borsh; consider making it error gracefully if it hits the cap
- Seems like the max of the bigint is usize::MAX*32 bits, but the max that can be serialized is u32::MAX*8 bits (edge case I know but https://github.com/rust-num/num-bigint/blob/63d96021248d6f96e31c53ff69875f7a40a90ea6/src/biguint/convert.rs#L584 where they max the buffer size without erroring seems like the overflow Aleksey hit above)

Aug 04 '22 15:08 austinabell

Okay, so the changes for switching to varints are the commits 21d4857 (#91) and bb625ee (#91)

It's not clear to me the complexity is worth it, and we could compact the number of bytes even further if we wanted (using a full byte for the sign, which should require 2 bits max and maybe could condense the actual value encoding). However, it seems like a clear win for optimizing encoding size. I don't hold opinions on this, so please share any reasons why we shouldn't do this.

Oct 13 '22 03:10 austinabell

@austinabell could you write a public doc comment somewhere which specifies how exactly we serialize things? This is two kill two birds with one stone:

make it easier to review code, by spelling out what it intends to do
borsh is language-independent format, so what we are doing here is not adding support for bigint library to borsh-rs crate, but rather adding support for arbitrary large integers to the borsh format. For the latter, I think it's useful to have docs, so that, eg, js version of borsh can add support for bigints.

I do think that we want to support this capability in the format.

Oct 13 '22 09:10 matklad

@austinabell could you write a public doc comment somewhere which specifies how exactly we serialize things? This is two kill two birds with one stone:

make it easier to review code, by spelling out what it intends to do

borsh is language-independent format, so what we are doing here is not adding support for bigint library to borsh-rs crate, but rather adding support for arbitrary large integers to the borsh format. For the latter, I think it's useful to have docs, so that, eg, js version of borsh can add support for bigints.

I do think that we want to support this capability in the format.

My thinking is that it would only be worth it for big int and not big decimal, would you agree? BigDecimal is just an abstraction on top of existing types, so maybe makes sense for it to live outside of borsh implementations?

Also, because it seems there are bugs in the bigdecimal impl (at least https://github.com/akubera/bigdecimal-rs/issues/94 https://github.com/akubera/bigdecimal-rs/issues/93, but I see others) and it's an opinionated implementation, I'd be a proponent of just removing this from the PR. Technically the same "decimal" number can have multiple serialization formats (since the integer can just be multiplied/divided by 10 for every exponent change), so it feels like we should just avoid this can of worms altogether. One could just add the borsh impls to that library or create their own type for this.

As for the BigUint implementation, it is very similar to bytes deserialization except that the LE bytes length is varint rather than u32 (which bytes probably should have been, to begin with). I can add that specification to https://github.com/near/borsh once it's decided if we want to keep/include bigdecimal.

Oct 13 '22 17:10 austinabell

:+1:on removing bigdecimal from this PR.

Oct 14 '22 08:10 matklad

BigDecimal is opinionated and not defined as borsh type yet, so I will close this PR for now until we absolutely need it (same as #109)

May 25 '23 09:05 frol

borsh-rs borsh-rs copied to clipboard

feat: add BigDecimal support

borsh-rs
borsh-rs copied to clipboard