bug: request response protocol outbound failure, UnexpectedEof
Summary
We have a server that is connected to 1000s of peers, and is using request-response protocol to send messages to them.
We seem to be getting lots of the following errors from request-response protocol OutboundFailure event with the following error:
Io(Custom { kind: UnexpectedEof, error: Eof { name: "array", expect: Small(1) } })
At high loads, more than half of our messages get this error.
Max concurrent streams number is 20000 as given to the behaviour config.
Any ideas what may be the cause of this?
Expected behavior
Should not get EoF errors as noted in summary section.
Actual behavior
Causes EoF errors as noted in summary section.
Relevant log output
[2025-05-27T11:33:15Z ERROR dkn_rpc::node::reqres::task] Task 01970504-d4f6-7000-8ac0-299950153a46/01970505-32b4-7043-80ac-9fc3cd707371 from 16Uiu2HAmVAfDDmVSK8ScJf1yrrWiYSVPVy7QBbNcJEhLv5iXQomM failed without a response: IO error on outbound stream: Eof { name: "array", expect: Small(1) }
Possible Solution
Version
0.55.0
Would you like to work on fixing this bug?
No
EoF errors can be caused, e.g., by requests that are larger than the maximum request size (see request_response::cbor or request_response::json), or by non-matching protocol strings (see #5815). Can you double check that neither is the case?
Hey @elenaf9 !
I have checked both,
- I'm using
cborby the way, and a message that we have seen to fail has just 700 bytes, much below the 1 MiB limit for a request. - The protocol strings are definitely matching, we are using a common crate to get the string for both the requester and responder in our case.
I will keep looking to see if I'm mistaken on any of these, but it doesn't seem to be the case. Thank you, looking forward to more ideas!
Also, I couldn't see how I can change the request / response size limits while creating the behaviour. 🤔
Hmm okay. Can you please provide some more logs before and after the error appears?
Also, I couldn't see how I can change the request / response size limits while creating the behaviour. 🤔
Unfortunately that's not released yet. But if you change your dependency to the current master, it can be done with
let codec = cbor::Codec::default().set_request_size_maximum(...);
let behavior = Behavior::with_codec(codec, ...);
Note that there is also a size limit on responses.
The thing is I'm unable to reproduce this locally so far, it only happens in our production machine and that one has a lot of peers connected to it, would be hard to print any logs.
The server code itself is closed, the client is not but I can provide the following info:
- Both server and client are using this for their protocol strings, so if their major & minor versions match they should definitely be using the same protocol. And we are sure that they match because clients discover the servers via this. Even further, if for some reason a different versioned client wants to connect, we detect it within our
identityevent handler and check to see if they match.
identify::Event::Received { info, peer_id, .. } => {
if info.protocol_version != self.config.protocol.identity {
// protocols do not match, disconnect this bad peer
log::warn!(
"Identify: Peer {peer_id} has different Identify protocol: (them {}, you {})",
info.protocol_version,
self.config.protocol.identity
);
// we disconnect here, but removal happens in the connection closed event
if self.swarm.disconnect_peer_id(peer_id).is_ok() {
// returns Ok if the peer was connected before, otherwise we dont need to log anything
log::warn!("Disconnected peer {peer_id} due to protocol mismatch");
}
} else {
// ensure the key is from secp256k1, and obtain its address from it
let pk = libsecp256k1::PublicKey::parse_compressed(
&info
.public_key
.try_into_secp256k1()
.wrap_err("expected secp256k1 public key")?
.to_bytes(),
)
.wrap_err("could not parse public key")?;
// convert the public key to an address, and encode as hex string
// the hex is always lowercase via `hex::encode`
let address = public_key_to_address(&pk);
let address_hex = hex::encode(address);
self.peers.add_peer(peer_id, address_hex);
}
}
-
We have a handler so that if a connected server connection is disrupted, it first tries to dial again, and if that fails as well it will get a new address. So what happens most likely is that server is getting EoF when sending a request, but clients do not have any issues sending request to that server.
-
I will try to come up with a reproducible example on a smaller repo, using the same behaviour etc. to see if I can find any leads.
Thank you!
would like to work on this
The thing is I'm unable to reproduce this locally so far, it only happens in our production machine and that one has a lot of peers connected to it, would be hard to print any logs.
Just to be really sure, does the network contain peers other than DKN clients? For example sharing bootstrapping peers would lead to a non-homogeneous network.
Just to be really sure, does the network contain peers other than DKN clients? For example sharing bootstrapping peers would lead to a non-homogeneous network.
We use a discovery service to pick the servers (we call them RPC nodes) (see https://mainnet.dkn.dria.co/discovery/v0/available-nodes/0.5), and then the compute nodes pick the first RPC with least amount of peers there to connect to. They dial that RPC explicitly (see here) and then they check their connectivity every once in a while. If the connection is closed for some reason, they will get another RPC to dial to from this service.
Our nodes also handle disconnects with cause (see here). They re-dial if the connection was closed for some reason. We noticed that when the connection is closed by the server with an explicit disconnect call, the cause in that event is None. This is why the nodes only re-dial if there is Some cause.
Nevertheless, the nodes check the identify events and ensure all protocols are matching, so we can assume all nodes belong to DKN.
All servers basically form a star topology with the clients that they are connected to, two different servers do not talk to each other, and likewise different clients do not interact with each other.
Would love to hear any ideas really, this error alone is killing our entire network. Approximately 90 out of every 100 requests have EoF error. Can't reproduce this error with local machines / dummy machines on AWS.
Anything that comes to your mind about networking stuff that may cause this? We have a lot of people running nodes from various VPS providers like Hetzner, Contabo etc.; could they require a setting or such?
Anything that comes to your mind about networking stuff that may cause this?
The only thing related to request-response that comes to my mind is #5858. The problem was introduced in libp2p-swarm==0.46.0, and later fixed in libp2p-swarm==0.46.1.
The only thing related to
request-responsethat comes to my mind is #5858. The problem was introduced in libp2p-swarm==0.46.0, and later fixed in libp2p-swarm==0.46.1.
Thanks, will check!
Would cbor vs json make any difference btw?
Would
cborvsjsonmake any difference btw?
Wdym by difference? The issue(5858) is related to substream negotiation, I don't think changing the codec will make any difference. And in general I personally don't think codec is involved in this issue(6038).
My guess was unrelated to the linked issue you linked above, but instead I was suspicious of this line here: https://github.com/libp2p/rust-libp2p/blob/master/protocols/request-response/src/cbor.rs#L129 which then goes to https://github.com/libp2p/rust-libp2p/blob/master/protocols/request-response/src/cbor.rs#L190