tmkms icon indicating copy to clipboard operation
tmkms copied to clipboard

Protobuf: buffer underflow

Open activenodes opened this issue 1 year ago • 65 comments

Chains with:

Every ~40 sigratures (softsign due to low time blocks) connection go down with this error:

  • protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow

Let me know if you need any more details.

Thanks @tony-iqlusion

activenodes avatar May 26 '23 20:05 activenodes

What is on the other side of the privval connection? You're the first to report an error like this in many years of TMKMS existing, so it seems like the other side is the likely culprit.

tony-iqlusion avatar May 26 '23 20:05 tony-iqlusion

It's a testnet (Sei Network). Other 3/4 guys can confirm the same issue (like @mkaczanowski)

activenodes avatar May 26 '23 20:05 activenodes

Hi, I can confirm that we have the same problem, afaik "buffer underflow" indicates an attempt to read a message that is larger than the amount of data available in the buffer (on tmkms side)

imperator-co avatar May 26 '23 23:05 imperator-co

It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run.

tony-iqlusion avatar May 27 '23 04:05 tony-iqlusion

It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run.

I'm not a Horcrux user, but it works fine in this chain too. What can we provide you to analyze the cause?

activenodes avatar May 27 '23 08:05 activenodes

I need a much better description of the problem than this, and instructions on how to reproduce it. Ideally it would be nice to have a dump of the incoming Protobuf which TMKMS can't parse.

What chain are you running?

How did you install it?

How are you running it?

Which message is causing the problem? Can you link to the Protobuf Schema for it?

tony-iqlusion avatar May 27 '23 13:05 tony-iqlusion

Expanding a bit on this issue, we, too, faced this issue on Sei testnet (atlantic-2).

From what I understand, Sei has its own fork of tendermint - https://github.com/sei-protocol/sei-tendermint. So I believe the schema is https://github.com/sei-protocol/sei-tendermint/tree/main/proto, but I'm not 100% sure.

As for us, we were running tmkms as usual, as a separate process, and pointing seid (https://github.com/sei-protocol/sei-chain) to it via config parameter (which sets unix socket). The mode was softsign.

qezz avatar May 30 '23 17:05 qezz

I'm curious if they're expecting to use gRPC for the privval connection: https://github.com/sei-protocol/sei-tendermint/tree/main/privval

We don't support that yet: #73. We've been waiting for it to be added to upstream Tendermint, since it was removed in v0.35.

What version of Horcrux is it working with?

tony-iqlusion avatar May 30 '23 17:05 tony-iqlusion

My idea is to prepare you an environment with everything configured (chain/validator/tmkms) and give you access. Can you be interested?

activenodes avatar May 30 '23 18:05 activenodes

That would be very helpful, thank you

tony-iqlusion avatar May 30 '23 18:05 tony-iqlusion

That would be very helpful, thank you

I'm getting that server ready for you. Where I can send you logins (SSH)? If you want to contact me on Discord (Simo | Active Nodes#3233) or give me an email. Thanks!

activenodes avatar May 31 '23 08:05 activenodes

I will attempt to contact you via Discord

tony-iqlusion avatar May 31 '23 20:05 tony-iqlusion

I will attempt to contact you via Discord

Changed permissions... Could you try again with Discord (Simo | Active Nodes#3233)? I'm not using Signal... Thanks

activenodes avatar May 31 '23 20:05 activenodes

Request sent

tony-iqlusion avatar May 31 '23 21:05 tony-iqlusion

Hey @tony-iqlusion , Any update on this issue ? Let me know how I can help to deal with, thanks

IbrarMakaveli avatar Aug 14 '23 23:08 IbrarMakaveli

Note. regarding previous tests and SSH access, now devnet is suspended. So we have to use testnet or mainnet chain

activenodes avatar Aug 16 '23 10:08 activenodes

I haven't had time to look, sorry. Note that there's a Hub upgrade today and a Stride upgrade tomorrow so I don't have much time this week either.

What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection.

tony-iqlusion avatar Aug 16 '23 11:08 tony-iqlusion

What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection.

Should we increase the debug level? Or how to collect as many logs as possible?

Thanks

activenodes avatar Aug 17 '23 14:08 activenodes

You can try adding a dbg!(&msg) here: https://github.com/iqlusioninc/tmkms/blob/b6ce617/src/rpc.rs#L36

Out of curiosity, do you have protocol_version = "0.34" in the config?

tony-iqlusion avatar Aug 17 '23 15:08 tony-iqlusion

You can try adding a dbg!(&msg) here: https://github.com/iqlusioninc/tmkms/blob/b6ce617/src/rpc.rs#L36

Thanks, I will try later in testnet

Out of curiosity, do you have protocol_version = "0.34" in the config?

Yes protocol_version = "v0.34" (with the "v") ... I think I have all the chains like this. is it right?

activenodes avatar Aug 17 '23 15:08 activenodes

Here: https://pastebin.com/4wDMbwP3

ERROR tmkms::client: [atlantic-2@tcp://127.0.0.1:26658] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow`

activenodes avatar Aug 17 '23 15:08 activenodes

Thanks, I believe this should be the message which is causing the issue encoded in hex:

ba2b2ab72b0aa82b08201091a6cc0d20ffffffffffffffffff012a480a20a33e4740e0b974cf1e010e6794ff620487e52ad63718d77e46ba3a62e1adc406122408011220cf19b022e389f5cd701b17b800caf505171bb98ae967349da72513b2138c1820320c08f1fbf8a60610a7bab0920242220a20bb985439aebaf357fad861849528d82e763031abd79fcc87c3a305bc1485388442220a2039825ada697b2d6959d75fa711636c7627c450c87ed8fe1aeaba4f20e28d435042220a20d9c22318e69463191372f92f66e5bdbf859c9b0e52020753785df5f53ea4bacb42220a20fc7ae857c36f4ea2f56285281e283ccdd600f4934183574bf5fd0ebb4c278cf942220a20ab49231edaa3fec0921d42b85cf90ba61d5251144980fa16f62a842a201755ad42220a20a7a17a651bfc02228490723a752a96e9f8ac3d64d261377726e2cdb0ccc646e342220a203178335d0ebe7e0ce2905df5b23aa05b9295e09caf80ba45472d2f61ccc9fb2f42220a20887cddac11b90e7a71888fefec183600dd79b94e9c8c43fb24d66038970eb91242220a202bfaf4b02410e8ba736e0443d63543e855cb4fb1eb99522a66d764802392700e42220a209a3e9332f29019d03097bcf290f377939b6bf8ee6a9d1952920cb9433515ca9542220a20dd148712e5a43ca35a0ed3560d6e753598da4e2ff71bd16d9dd553eddc4adcde42220a207e1df69adb7f32259154e375e20e98cf0a1c95b9f78ef1349df3ab539ab615be42220a2070a1e82287b3775ab287f7fe4e803c9bce2b03cbb4d0dba83b86104a114cb6e542220a207783770f1b6f8bdb347a8a8f113ecc67842783b4afbebb608aced8a51bb29bce42220a2008f7c95794aec249d2eba33161199c28574bc77f13accff28e5d70c23c3ebbd642220a20b53a7d779ebd7d0778aa9bb66e4565fe80903cfac23268ec016f2cf00818be9042220a20fa2f7fe3daa69c0cdd44058a44734bb6d44fce3e0c4db585e77a5190869137a142220a20f26cd81b78fa41bd46df409302c976f3451b3f0065a0e1e3e8dcb9947fd7d07c42220a20d002698eebbfd05a20492eaae6ed769068fac6cf0f3a55b6869ae8cacf5b92bf42220a20c1a848f44dfc893511262c49dd48072eb50472d49398c44b27044a028c76113b42220a2089c0affd2d4c2151cc654eebec6f0d29302ed611664965616e534e3b854eb43f42220a2084df4847e6d5998ac091ff0d6905400c3c7264855277e88cf06a184b81d2676b42220a2010e5cb10974a839fdd1584f1efc256d20fbf9d710c40ac4b968e23cb4927125b42220a20b12308f12fac8ecdc13b84feba49bf617ae0fb084b8d5fa8e8c975ee7f38c7954a0052ad200890a6cc0d1a480a203dee68aaa35b98433eba6c0416ec13403e50e3a2afb6ed1beaca515bbd7a98fc

I'll try to take a look and see if I can figure out what it's supposed to be

tony-iqlusion avatar Aug 17 '23 17:08 tony-iqlusion

If I plug that string into: https://protobuf-decoder.netlify.app/

...it doesn't appear to have a valid field number, which would need to match one of these field types (1 - 8):

https://docs.rs/tendermint-proto/latest/src/tendermint_proto/prost/v0_38/tendermint.privval.rs.html#79-96

...but the decoder says if that is a protobuf, its field number is 695, which would be a very unusual field number for a proto (they start at 1 and generally don't much higher than 30 typically)

So I have no idea what that message is supposed to be or how/why Horcrux would support it because it doesn't seem to be encoded correctly as a proto as far as I can tell.

It's possible Horcrux simply ignores messages it doesn't understand, whereas TMKMS considers it an error.

tony-iqlusion avatar Aug 17 '23 17:08 tony-iqlusion

Does anyone happen to know what that message might be, or know where to ask about it?

tony-iqlusion avatar Aug 17 '23 20:08 tony-iqlusion

I don't really know too, if you could ignore this message, and deploy the code on a new branch, I could test and compile the tmkms on a machine and do the test on the testnet. We could ask Sei dev, but I'm not sure they know what it means either.

IbrarMakaveli avatar Aug 19 '23 23:08 IbrarMakaveli

I found the issue and our (Chorus One) testnet node runs with the patched / hacked tmkms version and everything seems to be fine.

TL;DR

The issue is that the serialized block proposal (SIGNED_MSG_TYPE_PROPOSAL) for weights from 4000 bytes to 6000 (eyeballed logs with additional log messages). The buffer defined below is 1024 bytes only: https://github.com/iqlusioninc/tmkms/blob/main/src/rpc.rs#L197

resulting in message being always cut, thus malformed.

Proof of concept fix

I simply bumped the buffer size to:

let mut buf = vec![0; DATA_MAX_SIZE * 10];

Proper fix

Now I wonder whether this limit is actually correct in this context: https://docs.rs/tendermint-p2p/latest/tendermint_p2p/secret_connection/constant.DATA_MAX_SIZE.html

because this is not strictly p2p message (between peers in tendermint network), but rather a node <-> signer relationship.

Now the question is how to fix it, shall I make the buffer len configurable or hardcode higher value, or does anyone has a better idea?

another option to fix it, could be simply using dynamic allocation via Vec::new(): https://doc.rust-lang.org/std/io/trait.Read.html#method.read_to_end

@tony-iqlusion could you share your thoughts here?

mkaczanowski avatar Sep 17 '23 01:09 mkaczanowski

@mkaczanowski sounds like a bug/issue in the tendermint-p2p crate. I would suggest opening an upstream issue on tendermint-rs

tony-iqlusion avatar Sep 18 '23 15:09 tony-iqlusion

issue created: https://github.com/informalsystems/tendermint-rs/issues/1356

mkaczanowski avatar Sep 18 '23 16:09 mkaczanowski

funny enough this value comes from tmkms: https://github.com/informalsystems/tendermint-rs/pull/696/files

(git blame)

mkaczanowski avatar Sep 18 '23 16:09 mkaczanowski

Yeah, all of tendermint-rs was originally extracted from TMKMS.

I don't know the origins of this particular constant but regardless tendermint-rs is the right place for a fix.

Thanks for filing the issue.

tony-iqlusion avatar Sep 18 '23 16:09 tony-iqlusion