tmkms
tmkms copied to clipboard
Protobuf: buffer underflow
Chains with:
-
github.com/tendermint/[email protected]
- blocks every
~350ms
Every ~40 sigratures (softsign due to low time blocks) connection go down with this error:
-
protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow
Let me know if you need any more details.
Thanks @tony-iqlusion
What is on the other side of the privval connection? You're the first to report an error like this in many years of TMKMS existing, so it seems like the other side is the likely culprit.
It's a testnet (Sei Network). Other 3/4 guys can confirm the same issue (like @mkaczanowski)
Hi, I can confirm that we have the same problem, afaik "buffer underflow" indicates an attempt to read a message that is larger than the amount of data available in the buffer (on tmkms side)
It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run.
It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run.
I'm not a Horcrux user, but it works fine in this chain too. What can we provide you to analyze the cause?
I need a much better description of the problem than this, and instructions on how to reproduce it. Ideally it would be nice to have a dump of the incoming Protobuf which TMKMS can't parse.
What chain are you running?
How did you install it?
How are you running it?
Which message is causing the problem? Can you link to the Protobuf Schema for it?
Expanding a bit on this issue, we, too, faced this issue on Sei testnet (atlantic-2
).
From what I understand, Sei has its own fork of tendermint - https://github.com/sei-protocol/sei-tendermint. So I believe the schema is https://github.com/sei-protocol/sei-tendermint/tree/main/proto, but I'm not 100% sure.
As for us, we were running tmkms as usual, as a separate process, and pointing seid
(https://github.com/sei-protocol/sei-chain) to it via config parameter (which sets unix socket). The mode was softsign
.
I'm curious if they're expecting to use gRPC for the privval
connection: https://github.com/sei-protocol/sei-tendermint/tree/main/privval
We don't support that yet: #73. We've been waiting for it to be added to upstream Tendermint, since it was removed in v0.35.
What version of Horcrux is it working with?
My idea is to prepare you an environment with everything configured (chain/validator/tmkms) and give you access. Can you be interested?
That would be very helpful, thank you
That would be very helpful, thank you
I'm getting that server ready for you. Where I can send you logins (SSH)? If you want to contact me on Discord (Simo | Active Nodes#3233) or give me an email. Thanks!
I will attempt to contact you via Discord
I will attempt to contact you via Discord
Changed permissions... Could you try again with Discord (Simo | Active Nodes#3233
)?
I'm not using Signal...
Thanks
Request sent
Hey @tony-iqlusion , Any update on this issue ? Let me know how I can help to deal with, thanks
Note. regarding previous tests and SSH access, now devnet is suspended. So we have to use testnet or mainnet chain
I haven't had time to look, sorry. Note that there's a Hub upgrade today and a Stride upgrade tomorrow so I don't have much time this week either.
What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection.
What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection.
Should we increase the debug level? Or how to collect as many logs as possible?
Thanks
You can try adding a dbg!(&msg)
here: https://github.com/iqlusioninc/tmkms/blob/b6ce617/src/rpc.rs#L36
Out of curiosity, do you have protocol_version = "0.34"
in the config?
You can try adding a
dbg!(&msg)
here: https://github.com/iqlusioninc/tmkms/blob/b6ce617/src/rpc.rs#L36
Thanks, I will try later in testnet
Out of curiosity, do you have
protocol_version = "0.34"
in the config?
Yes protocol_version = "v0.34"
(with the "v") ... I think I have all the chains like this. is it right?
Here: https://pastebin.com/4wDMbwP3
ERROR tmkms::client: [atlantic-2@tcp://127.0.0.1:26658] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow`
Thanks, I believe this should be the message which is causing the issue encoded in hex:
ba2b2ab72b0aa82b08201091a6cc0d20ffffffffffffffffff012a480a20a33e4740e0b974cf1e010e6794ff620487e52ad63718d77e46ba3a62e1adc406122408011220cf19b022e389f5cd701b17b800caf505171bb98ae967349da72513b2138c1820320c08f1fbf8a60610a7bab0920242220a20bb985439aebaf357fad861849528d82e763031abd79fcc87c3a305bc1485388442220a2039825ada697b2d6959d75fa711636c7627c450c87ed8fe1aeaba4f20e28d435042220a20d9c22318e69463191372f92f66e5bdbf859c9b0e52020753785df5f53ea4bacb42220a20fc7ae857c36f4ea2f56285281e283ccdd600f4934183574bf5fd0ebb4c278cf942220a20ab49231edaa3fec0921d42b85cf90ba61d5251144980fa16f62a842a201755ad42220a20a7a17a651bfc02228490723a752a96e9f8ac3d64d261377726e2cdb0ccc646e342220a203178335d0ebe7e0ce2905df5b23aa05b9295e09caf80ba45472d2f61ccc9fb2f42220a20887cddac11b90e7a71888fefec183600dd79b94e9c8c43fb24d66038970eb91242220a202bfaf4b02410e8ba736e0443d63543e855cb4fb1eb99522a66d764802392700e42220a209a3e9332f29019d03097bcf290f377939b6bf8ee6a9d1952920cb9433515ca9542220a20dd148712e5a43ca35a0ed3560d6e753598da4e2ff71bd16d9dd553eddc4adcde42220a207e1df69adb7f32259154e375e20e98cf0a1c95b9f78ef1349df3ab539ab615be42220a2070a1e82287b3775ab287f7fe4e803c9bce2b03cbb4d0dba83b86104a114cb6e542220a207783770f1b6f8bdb347a8a8f113ecc67842783b4afbebb608aced8a51bb29bce42220a2008f7c95794aec249d2eba33161199c28574bc77f13accff28e5d70c23c3ebbd642220a20b53a7d779ebd7d0778aa9bb66e4565fe80903cfac23268ec016f2cf00818be9042220a20fa2f7fe3daa69c0cdd44058a44734bb6d44fce3e0c4db585e77a5190869137a142220a20f26cd81b78fa41bd46df409302c976f3451b3f0065a0e1e3e8dcb9947fd7d07c42220a20d002698eebbfd05a20492eaae6ed769068fac6cf0f3a55b6869ae8cacf5b92bf42220a20c1a848f44dfc893511262c49dd48072eb50472d49398c44b27044a028c76113b42220a2089c0affd2d4c2151cc654eebec6f0d29302ed611664965616e534e3b854eb43f42220a2084df4847e6d5998ac091ff0d6905400c3c7264855277e88cf06a184b81d2676b42220a2010e5cb10974a839fdd1584f1efc256d20fbf9d710c40ac4b968e23cb4927125b42220a20b12308f12fac8ecdc13b84feba49bf617ae0fb084b8d5fa8e8c975ee7f38c7954a0052ad200890a6cc0d1a480a203dee68aaa35b98433eba6c0416ec13403e50e3a2afb6ed1beaca515bbd7a98fc
I'll try to take a look and see if I can figure out what it's supposed to be
If I plug that string into: https://protobuf-decoder.netlify.app/
...it doesn't appear to have a valid field number, which would need to match one of these field types (1 - 8):
https://docs.rs/tendermint-proto/latest/src/tendermint_proto/prost/v0_38/tendermint.privval.rs.html#79-96
...but the decoder says if that is a protobuf, its field number is 695, which would be a very unusual field number for a proto (they start at 1 and generally don't much higher than 30 typically)
So I have no idea what that message is supposed to be or how/why Horcrux would support it because it doesn't seem to be encoded correctly as a proto as far as I can tell.
It's possible Horcrux simply ignores messages it doesn't understand, whereas TMKMS considers it an error.
Does anyone happen to know what that message might be, or know where to ask about it?
I don't really know too, if you could ignore this message, and deploy the code on a new branch, I could test and compile the tmkms on a machine and do the test on the testnet. We could ask Sei dev, but I'm not sure they know what it means either.
I found the issue and our (Chorus One) testnet node runs with the patched / hacked tmkms version and everything seems to be fine.
TL;DR
The issue is that the serialized block proposal (SIGNED_MSG_TYPE_PROPOSAL
) for weights from 4000
bytes to 6000
(eyeballed logs with additional log messages). The buffer defined below is 1024
bytes only:
https://github.com/iqlusioninc/tmkms/blob/main/src/rpc.rs#L197
resulting in message being always cut, thus malformed.
Proof of concept fix
I simply bumped the buffer size to:
let mut buf = vec![0; DATA_MAX_SIZE * 10];
Proper fix
Now I wonder whether this limit is actually correct in this context: https://docs.rs/tendermint-p2p/latest/tendermint_p2p/secret_connection/constant.DATA_MAX_SIZE.html
because this is not strictly p2p message (between peers in tendermint network), but rather a node <-> signer
relationship.
Now the question is how to fix it, shall I make the buffer len configurable or hardcode higher value, or does anyone has a better idea?
another option to fix it, could be simply using dynamic allocation via Vec::new()
:
https://doc.rust-lang.org/std/io/trait.Read.html#method.read_to_end
@tony-iqlusion could you share your thoughts here?
@mkaczanowski sounds like a bug/issue in the tendermint-p2p
crate. I would suggest opening an upstream issue on tendermint-rs
issue created: https://github.com/informalsystems/tendermint-rs/issues/1356
funny enough this value comes from tmkms: https://github.com/informalsystems/tendermint-rs/pull/696/files
(git blame)
Yeah, all of tendermint-rs was originally extracted from TMKMS.
I don't know the origins of this particular constant but regardless tendermint-rs is the right place for a fix.
Thanks for filing the issue.