ice icon indicating copy to clipboard operation
ice copied to clipboard

Close 'closed channel' cause panic

Open lamhai1401 opened this issue 3 years ago • 5 comments

Your environment.

  • Version: pion/ice/[email protected]
  • Browser: Chrome newest version
  • Other Information - using simulcast feature

What did you do?

I'm trying to re-handshake peerconnection between pion and chrome continuously

What happened?

I think the root cause is gatherCandidateDone channel was closed before but still to do it again.

Screen Shot 2022-04-22 at 16 27 26 Screen Shot 2022-04-22 at 16 21 44

lamhai1401 avatar Apr 22 '22 09:04 lamhai1401

Interesting. Race here? https://github.com/pion/ice/blob/master/gather.go#L71

jech avatar Jun 08 '22 13:06 jech

Interesting. Race here? https://github.com/pion/ice/blob/master/gather.go#L71

I think race here.

My hot fix: Screen Shot 2022-06-09 at 10 52 46

Bug happen when i try to peer with ~200 clients, i got this error 2-3 times

lamhai1401 avatar Jun 09 '22 03:06 lamhai1401

@jech May i do this, i'm testing allday and no panic happen.

Screen Shot 2022-06-10 at 16 44 50

lamhai1401 avatar Jun 10 '22 09:06 lamhai1401

I don't think that's correct. If gatherCandidates is being called twice, your code would only close the channel once, leading to a hung goroutine somewhere. We need to understand the cause, not try to paper over a symptom.

jech avatar Jun 10 '22 10:06 jech

Could you please try this? (It's not wholly correct, but it should fix your immediate issue. We really should clean up the locking in gather.go, it's way too fine-grained, and I'm fairly sure it deadlocks at times.)

DO NOT COMMIT This is not correct.

diff --git a/agent.go b/agent.go
index 54728a2..73d9f10 100644
--- a/agent.go
+++ b/agent.go
@@ -113,6 +113,7 @@ type Agent struct {
        taskLoopDone chan struct{}
        err          atomicError
 
+       muGatherCandidate sync.Mutex
        gatherCandidateCancel func()
        gatherCandidateDone   chan struct{}
 
diff --git a/gather.go b/gather.go
index 15536ce..6119160 100644
--- a/gather.go
+++ b/gather.go
@@ -60,20 +60,25 @@ func (a *Agent) GatherCandidates() error {
        var gatherErr error
 
        if runErr := a.run(a.context(), func(ctx context.Context, agent *Agent) {
+               a.muGatherCandidate.Lock()
                if a.gatheringState != GatheringStateNew {
                        gatherErr = ErrMultipleGatherAttempted
+                       a.muGatherCandidate.Unlock()
                        return
                } else if a.onCandidateHdlr.Load() == nil {
                        gatherErr = ErrNoOnCandidateHandler
+                       a.muGatherCandidate.Unlock()
                        return
                }
 
-               a.gatherCandidateCancel() // Cancel previous gathering routine
-               ctx, cancel := context.WithCancel(ctx)
-               a.gatherCandidateCancel = cancel
-               a.gatherCandidateDone = make(chan struct{})
-
-               go a.gatherCandidates(ctx)
+               go func() {
+                       defer a.muGatherCandidate.Unlock()
+                       a.gatherCandidateCancel()
+                       ctx, cancel := context.WithCancel(ctx)
+                       a.gatherCandidateCancel = cancel
+                       a.gatherCandidateDone = make(chan struct{})
+                       a.gatherCandidates(ctx)
+               }()
        }); runErr != nil {
                return runErr
        }
``

jech avatar Jun 10 '22 10:06 jech

I am facing this very rare panic. Going to put the text here so that it's indexable for search:

panic: close of closed channel
goroutine 47164 [running]:
github.com/pion/ice/v2.(*Agent).gatherCandidates(0xc00153cf00, {0x14740a8?, 0xc002e8d180})
	/go/pkg/mod/github.com/pion/ice/[email protected]/gather.go:112 +0x58e
created by github.com/pion/ice/v2.(*Agent).GatherCandidates.func1
	/go/pkg/mod/github.com/pion/ice/[email protected]/gather.go:56 +0x1db

My assumption is, that canceling previous gathering does not happen synchronously. https://github.com/pion/ice/blob/be72bc709bb290aba9bdc842988d2f893300d967/gather.go#L51-L54

If it takes a little bit longer, it can close newly created gatherCandidateDone channel. https://github.com/pion/ice/blob/be72bc709bb290aba9bdc842988d2f893300d967/gather.go#L64

We need to make sure previous gathering completed before starting new.

m1k1o avatar Mar 27 '23 18:03 m1k1o