optimism icon indicating copy to clipboard operation
optimism copied to clipboard

op-node v1.13.3: thousands of stuck goroutines in p2p.SyncClient / libp2p

Open artemrootman opened this issue 7 months ago • 1 comments

Bug Description

After several hours of uptime, op-node (v1.13.3) gets stuck with thousands of goroutines, mostly in select, IO wait, or yamux session handlers. Sync stops working properly. The node appears connected but stops progressing with peers. We suspect an issue with how libp2p/yamux sessions are handled or cleaned up.

Steps to Reproduce

  1. Start op-node v1.13.3 using the official image us-docker.pkg.dev/oplabs-tools-artifacts/images/op-node:v1.13.3 (based on Alpine Linux 3.20).
  2. Let it run for 6–12 hours in a production setup with inbound/outbound peer traffic.
  3. Inspect goroutines using kill -ABRT $(pidof op-node) or pprof.
  4. Observe thousands of goroutines in stuck state, many coming from libp2p, yamux, or p2p.(*SyncClient).peerLoop.

Expected behavior

Goroutines should terminate or be cleaned up if the stream or peer becomes inactive or broken. Instead, they accumulate indefinitely, consuming resources and blocking new peer interactions.

Environment Information:

  • Operating System: Alpine Linux 3.20 (via Docker)
  • Container image: us-docker.pkg.dev/oplabs-tools-artifacts/images/op-node:v1.13.3
  • Package Version: op-node v1.13.3
  • go-libp2p: v0.36.2
  • go-yamux: v4.0.1
  • go-libp2p-pubsub: v0.12.0
  • CPU: 32-core
  • RAM: 128 GB
  • Disk: NVMe SSD
  • Load: <30% CPU, <40% RAM, disk idle

Configurations:

Environment variables and CLI:

OP_NODE__L2_ENGINE_AUTH_FILE=/jwtsecret
OP_NODE__L2_ENGINE_RPC=http://localhost:8551
OP_NODE__L1=<REDACTED>
OP_NODE__L1_BEACON=<REDACTED>
OP_NODE__RPC_ADDR=0.0.0.0
OP_NODE__RPC_PORT=8547
OP_NODE__METRICS_ENABLED=true
OP_NODE__P2P_ENABLED=true
OP_NODE__P2P_PRIV_PATH=/p2p-node-key.txt
OP_NODE__P2P_LISTEN_IP=0.0.0.0
OP_NODE__P2P_TCP_PORT=30303
OP_NODE__P2P_UDP_PORT=30303

Logs:

Truncated example of goroutines:

goroutine 3412893 [select]:
github.com/libp2p/go-yamux/v4.(*Stream).Read(0xc012cb2380, ...)
  stream.go:111
...
github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleNewStream
  comm.go:66

goroutine 4125262 [select, 2612 minutes]:
github.com/ethereum-optimism/optimism/op-node/p2p.(*SyncClient).peerLoop
  sync.go:589

goroutine 4646886 [select, 325 minutes]:
github.com/libp2p/go-yamux/v4.(*Stream).Read
  stream.go:111
...
github.com/libp2p/go-libp2p-pubsub.(*PubSub).handlePeerDead
  comm.go:150

Additional context

  • Problem persists across restarts.
  • System resources are not saturated.
  • We suspect either op-node does not cancel dead peer sessions properly, or libp2p/yamux streams are not cleaned up under some edge condition.
  • This causes sync issues and degraded networking performance over time.

artemrootman avatar Jun 09 '25 08:06 artemrootman

Same here, also on op-node v1.13.3

goroutine 1357345 [select, 3406 minutes]:                                                                                                                                                                                                                                                 11:12:30 [1527/1825]
github.com/libp2p/go-yamux/v4.(*Stream).Read(0xc00f4f9a40, {0xc0254d7e74, 0x1, 0x1})                                                                                                                                                                                                                          
        /go/pkg/mod/github.com/libp2p/go-yamux/[email protected]/stream.go:111 +0x1a5                                                                                                                                                                                                                                 
github.com/libp2p/go-libp2p/p2p/muxer/yamux.(*stream).Read(0x48b4ac?, {0xc0254d7e74?, 0x100c0028cae58?, 0xc0028cae60?})                                                                                                                                                                                       
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/muxer/yamux/stream.go:17 +0x18                                                                                                                                                                                                                    
github.com/libp2p/go-libp2p/p2p/net/swarm.(*Stream).Read(0xc02498fa80, {0xc0254d7e74?, 0x1000000001c?, 0xc0028caf10?})                                                                                                                                                                                        
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/net/swarm/swarm_stream.go:58 +0x2d                                                         
github.com/multiformats/go-multistream.(*lazyClientConn[...]).Read(0xc000100008?, {0xc0254d7e74?, 0x1?, 0x1?})                                         
        /go/pkg/mod/github.com/multiformats/[email protected]/lazyClient.go:68 +0x98                                                               
github.com/libp2p/go-libp2p/p2p/host/basic.(*streamWrapper).Read(0x222e700?, {0xc0254d7e74?, 0x0?, 0x0?})                                              
        /go/pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:1108 +0x22                                                        
github.com/libp2p/go-libp2p-pubsub.(*PubSub).handlePeerDead(0xc00131d8c8, {0x225a8b0, 0xc02395c500})                                                   
        /go/pkg/mod/github.com/libp2p/[email protected]/comm.go:150 +0x73                                                                       
created by github.com/libp2p/go-libp2p-pubsub.(*PubSub).handleNewPeer in goroutine 1357295

goroutine 943461 [IO wait]:                                                                                                                                                                                                                                                               11:12:30 [1585/1825]
internal/poll.runtime_pollWait(0x7bc6cccc8ad8, 0x72)                                                                                                                                                                                                                                                          
        /usr/local/go/src/runtime/netpoll.go:351 +0x85                                                                                                                                                                                                                                                        
internal/poll.(*pollDesc).wait(0xc0183a3880?, 0xc019096000?, 0x0)                                                                                                                                                                                                                                             
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27                                                                                                                                                                                                                                           
internal/poll.(*pollDesc).waitRead(...)   

Creamers158 avatar Jun 17 '25 10:06 Creamers158

Same here, also on op-node v1.13.3, haven't tried v1.13.4 yet

rrrengineer avatar Jun 24 '25 19:06 rrrengineer