rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

Intermittent Gossip Data Propagation Issues in libp2p Network (Rust-libp2p 0.54)

Open vinay10949 opened this issue 6 months ago • 9 comments

Summary

We're experiencing intermittent issues with gossip data propagation in our libp2p network using Rust-libp2p 0.54. The problem occurs on both local development machines and test servers, where nodes sometimes fail to receive gossiped messages despite appearing to be connected.

Symptoms:

  1. Nodes sometimes fail to receive gossiped data
  2. Bootstrap connections occasionally fail with HandshakeTimedOut errors
  3. Gossipsub mesh reports needing more peers even when nodes are connected
  4. Kademlia bootstrap queries complete but don't always result in stable connections

Configuration:

// Network setup
let mut config = libp2p::quic::Config::new(&keypair.unwrap().clone());
config.max_idle_timeout = 10*1000; // 10 seconds
config.keep_alive_interval = Duration::from_secs(5);

// Gossipsub config
let gossipsub_config = gossipsub::ConfigBuilder::default()
    .heartbeat_interval(Duration::from_secs(HEARTBEAT_INTERVAL)) // 5 seconds
    .validation_mode(gossipsub::ValidationMode::Strict)
    .duplicate_cache_time(Duration::from_secs(DUPLICATE_CACHE_DURATION)) // 10 seconds
    .max_transmit_size(1_000_000)
    .message_id_fn(message_id_fn)
    .max_messages_per_rpc(Some(MAX_MESSAGES_PER_RPC)) // 100
    .mesh_n_low(4)
    .mesh_n_high(10)
    .mesh_n(8)
    .build()?;


//Swarm setup 

#[tracing::instrument(skip(keypair))]
pub async fn setup_swarm_network(
    keypair: Option<Keypair>,
    bootstrap_addresses: Option<Vec<(PeerId, Multiaddr)>>,
    port: String,
) -> Result<Swarm<SwarmBehaviour>, Box<dyn Error>> {
    // Set up the SwarmBuilder based on whether a keypair is provided or not.
    let builder = if let Some(keypair) = keypair.clone() {
        // Use the provided keypair for the swarm identity.
        SwarmBuilder::with_existing_identity(keypair)
    } else {
        // Generate a new identity if no keypair is provided.
        SwarmBuilder::with_new_identity()
    };
    let mut config = libp2p::quic::Config::new(&keypair.unwrap().clone());
   // config.max_idle_timeout = 300;
   config.max_idle_timeout = 10*1000;
    //config.keep_alive_interval = Duration::from_millis(100);
    config.keep_alive_interval=Duration::from_secs(5);
    // Build the libp2p swarm with a specific transport (TCP and QUIC), and relay client.
    let mut swarm = builder
        .with_tokio() // Use Tokio for asynchronous execution.
        .with_quic_config(|_| config)
        .with_behaviour(|keypair| {
            // If no bootstrap addresses are provided, print the peer ID for informational purposes.
            if bootstrap_addresses.is_none() {
                info!("Bootstrap Peer ID :{}", keypair.public().to_peer_id());
            }
            // Initialize the custom MyBehaviour which includes Gossipsub and Kademlia behaviors.
            SwarmBehaviour::new(keypair.clone()).unwrap()
        })?
        .with_swarm_config(|c| {
            // Configure idle connection timeout.
            c.with_idle_connection_timeout(Duration::from_secs(60))
        })
        .build();

    // If bootstrap nodes are provided, add them to the Kademlia behavior.
    if let Some(ref bootstrap_addresses) = bootstrap_addresses {
        for (peer_id, multi_addr) in bootstrap_addresses {
            // Add each bootstrap node's address to the Kademlia DHT.
            swarm
                .behaviour_mut()
                .kademlia
                .add_address(peer_id, multi_addr.clone());
            swarm.dial(multi_addr.clone())?;
            // Trigger the Kademlia bootstrap process to find more peers.
         
        }
   swarm.behaviour_mut().kademlia.bootstrap()?;
    }

    // Subscribe to the primary Gossipsub topic for network-wide communication.
    swarm
        .behaviour_mut()
        .gossipsub
        .subscribe(&IdentTopic::new(NETWORK_TOPIC))?;

    // Define the address to listen on for incoming connections (QUIC over UDP).
    let listen_address = format!("/ip4/0.0.0.0/udp/{}/quic-v1", port);
    swarm.listen_on(listen_address.parse()?)?;

    // Return the initialized swarm.
    Ok(swarm)
}

Logs: From Node 1 (working):

[TRACE] Sending message to peer 16Uiu2HAmR6ogo4eHfXuz28HNS2XJUGcB1R9Wf4UzHh7go18LQX3v
[TRACE] Sending message to peer 16Uiu2HAmGjjk8mDH5F1Y3FVW68tenMNWMNkTcZXceLMnXUEJDoSx
[TRACE] Sending message to peer 16Uiu2HAmMwshLKvkHnMsgJ5MPxcLeVkkSxRK8Rm6cFRaCCTkhhEd

From Node 2 (failing):

[ERROR] Failed to establish outgoing connection. Connection ID: ConnectionId(8), 
Peer ID: Some(PeerId("16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X")), 
Error: Transport([(/ip4/127.0.0.1/udp/7070/quic-v1/p2p/16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X, 
Other(Custom { kind: Other, error: Other(Right(HandshakeTimedOut)) }))]).

[DEBUG] HEARTBEAT: Mesh low. Topic contains: 0 needs: 4
[DEBUG] RANDOM PEERS: Got 0 peers

Expected behavior

  1. Stable connections between nodes
  2. Reliable gossip message propagation
  3. Healthy mesh network with sufficient peers

Actual behavior

  1. Intermittent connection failures
  2. Gossip messages sometimes not received
  3. Mesh peer count often below configured minimum

1.log

2.log

Relevant log output


Possible Solution

No response

Version

0.54

Would you like to work on fixing this bug?

Yes

vinay10949 avatar May 25 '25 08:05 vinay10949

Sometimes I get this

peer_id: Some(PeerId("16Uiu2HAmT4FjyydhhSYgLoGjNJEFGHDexiaH6UxWM1VCW1LT5o1X"))
[2025-05-25T15:22:35.888Z] TRACE: NODE/31469 on abc.local: [RECV - EVENT] got frame ResetStream(ResetStream { id: StreamId(32), error_code: 0, final_offset: 24 }) (address=/ip4/127.0.0.1/udp/7070/quic-v1,id=0,line=2648,pn=380,space=Data,target=quinn_proto::connection)
    file: cargo/registry/src/index.crates.io-1949cf8c6b5b557f/quinn-proto-0.11.9/src/connection/mod.rs
    --

vinay10949 avatar May 25 '25 17:05 vinay10949

Hi, have you tried enabling TCP? if so do the symptoms persist?

jxs avatar May 29 '25 09:05 jxs

@jxs Yes currently we have disabled quic and enabled TCP. Also our network is very small . like 4 nodes

Also we spotted this

May 29 07:53:04 guardian-testnet-2 sh[25552]: {"v":0,"name":"DUCAT_NODE","msg":"[SWARM::POLL - EVENT] Request to peer in query failed with Io(Custom { kind: ConnectionRefused, error: \"protocol not supported\" })","level":20,"hostname":"guardian-testnet-2","pid":25552,"time":"2025-05-29T07:53:04.078781109Z","target":"libp2p_kad::behaviour","line":2358,"file":"/home/admin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-kad-0.46.2/src/behaviour.rs","peer":"16Uiu2HAm4QSE1Q6jvtbq4NZjmBdUb4k34KG17vS5bRRfv5ZYtZyE","query":"QueryId(0)"}

vinay10949 avatar May 29 '25 11:05 vinay10949

would like to work on this!

rose2221 avatar May 29 '25 19:05 rose2221

@rose2221 @jxs What may be the reason for message loss. It succeeds for sometime, possibly few mins even 20-30mins, then suddenly stops gossiping?

vinay10949 avatar Jun 06 '25 14:06 vinay10949

Hi, I really do not know without more info, do you have a MRE? but without giving much thought, Io(Custom { kind: ConnectionRefused, error: \"protocol not supported\" }) seems quite odd, if you are trying to open a new kademlia stream on a connected peer with streams, what would make that peer stop supporting kademlia?

jxs avatar Jun 06 '25 15:06 jxs

@jxs its hard to reproduce. Getting below logs

Jun 07 08:21:38 node-testnet-1 bash[92928]: {"v":0,"name":"NODE","msg":"[SWARM::POLL - EVENT] Connection attempt to peer failed with Transport([(/ip4/127.0.0.1/tcp/7070/p2p/16Uiu2HAmNEzpDgsEVjM1eitWvC6hJDmsB8zCNyWEZHxtFpuTzYou, Other(Custom { kind: Other, error: Other(Left(Right(Apply(Io(Custom { kind: InvalidData, error: Input }))))) })), (/ip4/54.144.205.142/tcp/7070/p2p/16Uiu2HAmNEzpDgsEVjM1eitWvC6hJDmsB8zCNyWEZHxtFpuTzYou, Other(Custom { kind: Other, error: Other(Left(Left(Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" }))) })), (/ip4/172.31.23.245/tcp/7070/p2p/16Uiu2HAmNEzpDgsEVjM1eitWvC6hJDmsB8zCNyWEZHxtFpuTzYou, Other(Custom { kind: Other, error: Other(Left(Left(Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" }))) }))]).","level":20,"hostname":"guardian-testnet-1","pid":92928,"time":"2025-06-07T08:21:38.133881709Z","target":"libp2p_swarm","line":849,"file":"/home/admin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-swarm-0.45.1/src/lib.rs","peer":"16Uiu2HAmNEzpDgsEVjM1eitWvC6hJDmsB8zCNyWEZHxtFpuTzYou"}

vinay10949 avatar Jun 07 '25 09:06 vinay10949

@vinay10949 are your nodes behind NAT ? If yes is there any procedure of relaying in between or hole punching ?

Sansh2356 avatar Jun 17 '25 18:06 Sansh2356

@Sansh2356 We have public ips and we are not behind NAT. the observation is gossip works for some time, then node completely stops gossiping ,some restarts are needed

vinay10949 avatar Jun 17 '25 19:06 vinay10949