rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

bug: request response protocol outbound failure, UnexpectedEof

Open erhant opened this issue 6 months ago • 12 comments

Summary

We have a server that is connected to 1000s of peers, and is using request-response protocol to send messages to them.

We seem to be getting lots of the following errors from request-response protocol OutboundFailure event with the following error:

Io(Custom { kind: UnexpectedEof, error: Eof { name: "array", expect: Small(1) } })

At high loads, more than half of our messages get this error.

Max concurrent streams number is 20000 as given to the behaviour config.

Any ideas what may be the cause of this?

Expected behavior

Should not get EoF errors as noted in summary section.

Actual behavior

Causes EoF errors as noted in summary section.

Relevant log output

[2025-05-27T11:33:15Z ERROR dkn_rpc::node::reqres::task] Task 01970504-d4f6-7000-8ac0-299950153a46/01970505-32b4-7043-80ac-9fc3cd707371 from 16Uiu2HAmVAfDDmVSK8ScJf1yrrWiYSVPVy7QBbNcJEhLv5iXQomM failed without a response: IO error on outbound stream: Eof { name: "array", expect: Small(1) }

Possible Solution

Version

0.55.0

Would you like to work on fixing this bug?

No

erhant avatar May 27 '25 12:05 erhant

EoF errors can be caused, e.g., by requests that are larger than the maximum request size (see request_response::cbor or request_response::json), or by non-matching protocol strings (see #5815). Can you double check that neither is the case?

elenaf9 avatar May 28 '25 12:05 elenaf9

Hey @elenaf9 !

I have checked both,

  • I'm using cbor by the way, and a message that we have seen to fail has just 700 bytes, much below the 1 MiB limit for a request.
  • The protocol strings are definitely matching, we are using a common crate to get the string for both the requester and responder in our case.

I will keep looking to see if I'm mistaken on any of these, but it doesn't seem to be the case. Thank you, looking forward to more ideas!

Also, I couldn't see how I can change the request / response size limits while creating the behaviour. 🤔

erhant avatar May 28 '25 13:05 erhant

Hmm okay. Can you please provide some more logs before and after the error appears?

Also, I couldn't see how I can change the request / response size limits while creating the behaviour. 🤔

Unfortunately that's not released yet. But if you change your dependency to the current master, it can be done with

let codec = cbor::Codec::default().set_request_size_maximum(...);
let behavior = Behavior::with_codec(codec, ...);

Note that there is also a size limit on responses.

elenaf9 avatar May 28 '25 15:05 elenaf9

The thing is I'm unable to reproduce this locally so far, it only happens in our production machine and that one has a lot of peers connected to it, would be hard to print any logs.

The server code itself is closed, the client is not but I can provide the following info:

  • Both server and client are using this for their protocol strings, so if their major & minor versions match they should definitely be using the same protocol. And we are sure that they match because clients discover the servers via this. Even further, if for some reason a different versioned client wants to connect, we detect it within our identity event handler and check to see if they match.
              identify::Event::Received { info, peer_id, .. } => {
                if info.protocol_version != self.config.protocol.identity {
                    // protocols do not match, disconnect this bad peer
                    log::warn!(
                        "Identify: Peer {peer_id} has different Identify protocol: (them {}, you {})",
                        info.protocol_version,
                        self.config.protocol.identity
                    );

                    // we disconnect here, but removal happens in the connection closed event
                    if self.swarm.disconnect_peer_id(peer_id).is_ok() {
                        // returns Ok if the peer was connected before, otherwise we dont need to log anything
                        log::warn!("Disconnected peer {peer_id} due to protocol mismatch");
                    }
                } else {
                    // ensure the key is from secp256k1, and obtain its address from it
                    let pk = libsecp256k1::PublicKey::parse_compressed(
                        &info
                            .public_key
                            .try_into_secp256k1()
                            .wrap_err("expected secp256k1 public key")?
                            .to_bytes(),
                    )
                    .wrap_err("could not parse public key")?;

                    // convert the public key to an address, and encode as hex string
                    // the hex is always lowercase via `hex::encode`
                    let address = public_key_to_address(&pk);
                    let address_hex = hex::encode(address);

                    self.peers.add_peer(peer_id, address_hex);
                }
            }
  • We have a handler so that if a connected server connection is disrupted, it first tries to dial again, and if that fails as well it will get a new address. So what happens most likely is that server is getting EoF when sending a request, but clients do not have any issues sending request to that server.

  • I will try to come up with a reproducible example on a smaller repo, using the same behaviour etc. to see if I can find any leads.

Thank you!

erhant avatar May 28 '25 18:05 erhant

would like to work on this

rose2221 avatar May 29 '25 19:05 rose2221

The thing is I'm unable to reproduce this locally so far, it only happens in our production machine and that one has a lot of peers connected to it, would be hard to print any logs.

Just to be really sure, does the network contain peers other than DKN clients? For example sharing bootstrapping peers would lead to a non-homogeneous network.

drHuangMHT avatar May 30 '25 11:05 drHuangMHT

Just to be really sure, does the network contain peers other than DKN clients? For example sharing bootstrapping peers would lead to a non-homogeneous network.

We use a discovery service to pick the servers (we call them RPC nodes) (see https://mainnet.dkn.dria.co/discovery/v0/available-nodes/0.5), and then the compute nodes pick the first RPC with least amount of peers there to connect to. They dial that RPC explicitly (see here) and then they check their connectivity every once in a while. If the connection is closed for some reason, they will get another RPC to dial to from this service.

Our nodes also handle disconnects with cause (see here). They re-dial if the connection was closed for some reason. We noticed that when the connection is closed by the server with an explicit disconnect call, the cause in that event is None. This is why the nodes only re-dial if there is Some cause.

Nevertheless, the nodes check the identify events and ensure all protocols are matching, so we can assume all nodes belong to DKN.

All servers basically form a star topology with the clients that they are connected to, two different servers do not talk to each other, and likewise different clients do not interact with each other.

erhant avatar May 30 '25 12:05 erhant

Would love to hear any ideas really, this error alone is killing our entire network. Approximately 90 out of every 100 requests have EoF error. Can't reproduce this error with local machines / dummy machines on AWS.

Anything that comes to your mind about networking stuff that may cause this? We have a lot of people running nodes from various VPS providers like Hetzner, Contabo etc.; could they require a setting or such?

erhant avatar May 30 '25 19:05 erhant

Anything that comes to your mind about networking stuff that may cause this?

The only thing related to request-response that comes to my mind is #5858. The problem was introduced in libp2p-swarm==0.46.0, and later fixed in libp2p-swarm==0.46.1.

drHuangMHT avatar May 31 '25 10:05 drHuangMHT

The only thing related to request-response that comes to my mind is #5858. The problem was introduced in libp2p-swarm==0.46.0, and later fixed in libp2p-swarm==0.46.1.

Thanks, will check!

Would cbor vs json make any difference btw?

erhant avatar Jun 01 '25 13:06 erhant

Would cbor vs json make any difference btw?

Wdym by difference? The issue(5858) is related to substream negotiation, I don't think changing the codec will make any difference. And in general I personally don't think codec is involved in this issue(6038).

drHuangMHT avatar Jun 02 '25 06:06 drHuangMHT

My guess was unrelated to the linked issue you linked above, but instead I was suspicious of this line here: https://github.com/libp2p/rust-libp2p/blob/master/protocols/request-response/src/cbor.rs#L129 which then goes to https://github.com/libp2p/rust-libp2p/blob/master/protocols/request-response/src/cbor.rs#L190

erhant avatar Jun 02 '25 07:06 erhant