Intermittent Gossip Data Propagation Issues in libp2p Network (Rust-libp2p 0.54)
Summary
We're experiencing intermittent issues with gossip data propagation in our libp2p network using Rust-libp2p 0.54. The problem occurs on both local development machines and test servers, where nodes sometimes fail to receive gossiped messages despite appearing to be connected.
Symptoms:
- Nodes sometimes fail to receive gossiped data
- Bootstrap connections occasionally fail with
HandshakeTimedOuterrors - Gossipsub mesh reports needing more peers even when nodes are connected
- Kademlia bootstrap queries complete but don't always result in stable connections
Configuration:
// Network setup
let mut config = libp2p::quic::Config::new(&keypair.unwrap().clone());
config.max_idle_timeout = 10*1000; // 10 seconds
config.keep_alive_interval = Duration::from_secs(5);
// Gossipsub config
let gossipsub_config = gossipsub::ConfigBuilder::default()
.heartbeat_interval(Duration::from_secs(HEARTBEAT_INTERVAL)) // 5 seconds
.validation_mode(gossipsub::ValidationMode::Strict)
.duplicate_cache_time(Duration::from_secs(DUPLICATE_CACHE_DURATION)) // 10 seconds
.max_transmit_size(1_000_000)
.message_id_fn(message_id_fn)
.max_messages_per_rpc(Some(MAX_MESSAGES_PER_RPC)) // 100
.mesh_n_low(4)
.mesh_n_high(10)
.mesh_n(8)
.build()?;
//Swarm setup
#[tracing::instrument(skip(keypair))]
pub async fn setup_swarm_network(
keypair: Option<Keypair>,
bootstrap_addresses: Option<Vec<(PeerId, Multiaddr)>>,
port: String,
) -> Result<Swarm<SwarmBehaviour>, Box<dyn Error>> {
// Set up the SwarmBuilder based on whether a keypair is provided or not.
let builder = if let Some(keypair) = keypair.clone() {
// Use the provided keypair for the swarm identity.
SwarmBuilder::with_existing_identity(keypair)
} else {
// Generate a new identity if no keypair is provided.
SwarmBuilder::with_new_identity()
};
let mut config = libp2p::quic::Config::new(&keypair.unwrap().clone());
// config.max_idle_timeout = 300;
config.max_idle_timeout = 10*1000;
//config.keep_alive_interval = Duration::from_millis(100);
config.keep_alive_interval=Duration::from_secs(5);
// Build the libp2p swarm with a specific transport (TCP and QUIC), and relay client.
let mut swarm = builder
.with_tokio() // Use Tokio for asynchronous execution.
.with_quic_config(|_| config)
.with_behaviour(|keypair| {
// If no bootstrap addresses are provided, print the peer ID for informational purposes.
if bootstrap_addresses.is_none() {
info!("Bootstrap Peer ID :{}", keypair.public().to_peer_id());
}
// Initialize the custom MyBehaviour which includes Gossipsub and Kademlia behaviors.
SwarmBehaviour::new(keypair.clone()).unwrap()
})?
.with_swarm_config(|c| {
// Configure idle connection timeout.
c.with_idle_connection_timeout(Duration::from_secs(60))
})
.build();
// If bootstrap nodes are provided, add them to the Kademlia behavior.
if let Some(ref bootstrap_addresses) = bootstrap_addresses {
for (peer_id, multi_addr) in bootstrap_addresses {
// Add each bootstrap node's address to the Kademlia DHT.
swarm
.behaviour_mut()
.kademlia
.add_address(peer_id, multi_addr.clone());
swarm.dial(multi_addr.clone())?;
// Trigger the Kademlia bootstrap process to find more peers.
}
swarm.behaviour_mut().kademlia.bootstrap()?;
}
// Subscribe to the primary Gossipsub topic for network-wide communication.
swarm
.behaviour_mut()
.gossipsub
.subscribe(&IdentTopic::new(NETWORK_TOPIC))?;
// Define the address to listen on for incoming connections (QUIC over UDP).
let listen_address = format!("/ip4/0.0.0.0/udp/{}/quic-v1", port);
swarm.listen_on(listen_address.parse()?)?;
// Return the initialized swarm.
Ok(swarm)
}
Logs: From Node 1 (working):
[TRACE] Sending message to peer 16Uiu2HAmR6ogo4eHfXuz28HNS2XJUGcB1R9Wf4UzHh7go18LQX3v
[TRACE] Sending message to peer 16Uiu2HAmGjjk8mDH5F1Y3FVW68tenMNWMNkTcZXceLMnXUEJDoSx
[TRACE] Sending message to peer 16Uiu2HAmMwshLKvkHnMsgJ5MPxcLeVkkSxRK8Rm6cFRaCCTkhhEd
From Node 2 (failing):
[ERROR] Failed to establish outgoing connection. Connection ID: ConnectionId(8),
Peer ID: Some(PeerId("16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X")),
Error: Transport([(/ip4/127.0.0.1/udp/7070/quic-v1/p2p/16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X,
Other(Custom { kind: Other, error: Other(Right(HandshakeTimedOut)) }))]).
[DEBUG] HEARTBEAT: Mesh low. Topic contains: 0 needs: 4
[DEBUG] RANDOM PEERS: Got 0 peers
Expected behavior
- Stable connections between nodes
- Reliable gossip message propagation
- Healthy mesh network with sufficient peers
Actual behavior
- Intermittent connection failures
- Gossip messages sometimes not received
- Mesh peer count often below configured minimum
Relevant log output
Possible Solution
No response
Version
0.54
Would you like to work on fixing this bug?
Yes
Sometimes I get this
peer_id: Some(PeerId("16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X"))
[2025-05-25T15:22:35.888Z] TRACE: NODE/31469 on abc.local: [RECV - EVENT] got frame ResetStream(ResetStream { id: StreamId(32), error_code: 0, final_offset: 24 }) (address=/ip4/127.0.0.1/udp/7070/quic-v1,id=0,line=2648,pn=380,space=Data,target=quinn_proto::connection)
file: cargo/registry/src/index.crates.io-1949cf8c6b5b557f/quinn-proto-0.11.9/src/connection/mod.rs
--
Hi, have you tried enabling TCP? if so do the symptoms persist?
@jxs Yes currently we have disabled quic and enabled TCP. Also our network is very small . like 4 nodes
Also we spotted this
May 29 07:53:04 guardian-testnet-2 sh[25552]: {"v":0,"name":"DUCAT_NODE","msg":"[SWARM::POLL - EVENT] Request to peer in query failed with Io(Custom { kind: ConnectionRefused, error: \"protocol not supported\" })","level":20,"hostname":"guardian-testnet-2","pid":25552,"time":"2025-05-29T07:53:04.078781109Z","target":"libp2p_kad::behaviour","line":2358,"file":"/home/admin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-kad-0.46.2/src/behaviour.rs","peer":"16Uiu2HAm4QSE1Q6jvtbq4NZjmBdUb4k34KG17vS5bRRfv5ZYtZyE","query":"QueryId(0)"}
would like to work on this!
@rose2221 @jxs What may be the reason for message loss. It succeeds for sometime, possibly few mins even 20-30mins, then suddenly stops gossiping?
Hi, I really do not know without more info, do you have a MRE?
but without giving much thought, Io(Custom { kind: ConnectionRefused, error: \"protocol not supported\" }) seems quite odd, if you are trying to open a new kademlia stream on a connected peer with streams, what would make that peer stop supporting kademlia?
@jxs its hard to reproduce. Getting below logs
Jun 07 08:21:38 node-testnet-1 bash[92928]: {"v":0,"name":"NODE","msg":"[SWARM::POLL - EVENT] Connection attempt to peer failed with Transport([(/ip4/127.0.0.1/tcp/7070/p2p/16Uiu2HAmNEzpDgsEVjM1eitWvC6hJDmsB8zCNyWEZHxtFpuTzYou, Other(Custom { kind: Other, error: Other(Left(Right(Apply(Io(Custom { kind: InvalidData, error: Input }))))) })), (/ip4/54.144.205.142/tcp/7070/p2p/16Uiu2HAmNEzpDgsEVjM1eitWvC6hJDmsB8zCNyWEZHxtFpuTzYou, Other(Custom { kind: Other, error: Other(Left(Left(Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" }))) })), (/ip4/172.31.23.245/tcp/7070/p2p/16Uiu2HAmNEzpDgsEVjM1eitWvC6hJDmsB8zCNyWEZHxtFpuTzYou, Other(Custom { kind: Other, error: Other(Left(Left(Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" }))) }))]).","level":20,"hostname":"guardian-testnet-1","pid":92928,"time":"2025-06-07T08:21:38.133881709Z","target":"libp2p_swarm","line":849,"file":"/home/admin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-swarm-0.45.1/src/lib.rs","peer":"16Uiu2HAmNEzpDgsEVjM1eitWvC6hJDmsB8zCNyWEZHxtFpuTzYou"}
@vinay10949 are your nodes behind NAT ? If yes is there any procedure of relaying in between or hole punching ?
@Sansh2356 We have public ips and we are not behind NAT. the observation is gossip works for some time, then node completely stops gossiping ,some restarts are needed