go-ethereum icon indicating copy to clipboard operation
go-ethereum copied to clipboard

p2p: race condition in dialer scheduler

Open fxfactorial opened this issue 4 years ago • 6 comments

WARNING: DATA RACE                                                                                                    
Read at 0x00c03c44d2a0 by goroutine 152:                                                                              
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).updateStaticPool()                                             
      github.com/ethereum/go-ethereum/p2p/dial.go:433 +0x2e4                                                          
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).expireHistory.func1()                                          
      github.com/ethereum/go-ethereum/p2p/dial.go:378 +0xcf                                                           
  github.com/ethereum/go-ethereum/p2p.(*expHeap).expire()                                                             
      github.com/ethereum/go-ethereum/p2p/util.go:59 +0x10a 
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).expireHistory()
      github.com/ethereum/go-ethereum/p2p/dial.go:375 +0x124
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).loop()
      github.com/ethereum/go-ethereum/p2p/dial.go:308 +0x239a
  github.com/ethereum/go-ethereum/p2p.newDialScheduler.func2()
      github.com/ethereum/go-ethereum/p2p/dial.go:186 +0x53 
  github.com/panjf2000/ants/v2.(*goWorker).run.func1()
      github.com/panjf2000/ants/[email protected]/worker.go:70 +0xd9

Previous write at 0x00c03c44d2a0 by goroutine 1038:
  github.com/ethereum/go-ethereum/p2p.(*dialTask).resolve() 
      github.com/ethereum/go-ethereum/p2p/dial.go:537 +0x204
  github.com/ethereum/go-ethereum/p2p.(*dialTask).run()
      github.com/ethereum/go-ethereum/p2p/dial.go:498 +0xf9 
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).startDial.func1()
      github.com/ethereum/go-ethereum/p2p/dial.go:465 +0x58 
  github.com/panjf2000/ants/v2.(*goWorker).run.func1()
      github.com/panjf2000/ants/[email protected]/worker.go:70 +0xd9


off of latest master -

fxfactorial avatar Aug 22 '21 16:08 fxfactorial

Hmm that is not latest master afaict. Could you link the commit you're using?

MariusVanDerWijden avatar Aug 23 '21 06:08 MariusVanDerWijden

I fixed some races here: https://github.com/ethereum/go-ethereum/pull/23434 Don't think it's the one you found though

MariusVanDerWijden avatar Aug 23 '21 07:08 MariusVanDerWijden

The race in this issue is on t.dest. When resolve() runs, it may update t.dest. Meanwhile, on the dialer loop, we run updateStaticPool when a peer is removed, and this reads t.dest.

fjl avatar Aug 23 '21 10:08 fjl

My bad , was latest master off bsc and should have written the description @fjl wrote , yes that’s the problem -

fxfactorial avatar Aug 23 '21 13:08 fxfactorial

I'm actually not sure how to fix this race correctly. The check on t.dest in updateStaticPool seems off to me, maybe that should just be removed.

fjl avatar Aug 24 '21 12:08 fjl

Agree - it wasn’t clear to me how to fix it immediately either , and don’t want locks for just this - if node being removed anyway then .dest being updated shouldn’t be needed then

fxfactorial avatar Aug 24 '21 13:08 fxfactorial

This issue has become quite stale now. I can't reproduce the race condition anymore. Feel free to reopen if this is still an issue with the current master.

MariusVanDerWijden avatar Mar 13 '23 12:03 MariusVanDerWijden

I'm sure this race is not fixed. It's just hard to reproduce because it requires resolving the node, which takes variable time.

fjl avatar Mar 13 '23 13:03 fjl

@fjl how about solving it by just making it an atomic.Pointer? Like this: https://github.com/holiman/go-ethereum/commit/79b76150f0bbef7faa4e45b299ada41bb792b5d8

holiman avatar Jun 14 '23 09:06 holiman

Yeah I guess this would be a possible fix. Though I think there must be a better way to fix it.

The main issue is really just that the main loop wants access to the node ID, and it reads that through .dest.ID(). So another fix would be to store a separate, immutable reference to the original non-resolved *enode.Node and use that to read the ID.

fjl avatar Jun 14 '23 09:06 fjl