webrtc: potential problem with the Pion state machine
We might be using the Pion state machine incorrectly:
=== RUN TestTransportWebRTC_Deadline/SetWriteDeadline
transport_test.go:401:
Error Trace: /home/runner/work/go-libp2p/go-libp2p/p2p/transport/webrtc/transport_test.go:401
Error: Received unexpected error:
failed to secure outbound connection: error reading handshake message: sending reset packet in non-established state: state=Closed
Test: TestTransportWebRTC_Deadline/SetWriteDeadline
--- FAIL: TestTransportWebRTC_Deadline (20.14s)
--- PASS: TestTransportWebRTC_Deadline/SetReadDeadline (0.13s)
--- FAIL: TestTransportWebRTC_Deadline/SetWriteDeadline (20.00s)
There's one other bug which causes failure in connection establishment:
2023-10-22T06:47:19.9957753Z === RUN TestTransportWebRTC_DialerCanCreateStreamsMultiple
2023-10-22T06:47:19.9959034Z panic: test timed out after 10m0s
2023-10-22T06:47:19.9959911Z running tests:
2023-10-22T06:47:19.9960992Z TestTransportWebRTC_DialerCanCreateStreamsMultiple (9m58s)
https://github.com/libp2p/go-libp2p/actions/runs/6602211220/job/17934010538
Both these errors haven't happened since merging #2717
The first errors was fixed by fixing the tests. When one side closes the connection the other side sees this line when it does a stream Close.
sending reset packet in non-established state: state=Closed
Both these errors haven't happened since merging #2717
The first errors was fixed by fixing the tests. When one side closes the connection the other side sees this line when it does a stream Close.
sending reset packet in non-established state: state=Closed
How was the other error fixed?
I am not convinced it is fixed completely. We just find it flaking less for some reason.
One of the reasons why that error occurs is this https://github.com/libp2p/go-libp2p/actions/runs/8125238982/job/22214622459?pr=2722
This will be fixed when pion/sctp does a new release. This was fixed by: https://github.com/pion/sctp/pull/301
The other reason why it was flaking was an error in our udpmux code. That was fixed by: https://github.com/libp2p/go-libp2p/pull/2586
Again, not convinced it is fixed completely by these two. There might be some race condition in the conn establishment step that we are missing. One of the reasons for getting https://github.com/libp2p/go-libp2p/pull/2717 was that in the logs after failure there were so many go routines that this was impossible to debug. Hopefully now we can better identify the cause.
For the failures after 10m we see. I have explained the reasoning in the PR. My current theory is. The problem is with connection state change notification being reordered. Pion just runs them in a separate goroutine without waiting for it to finish. So here the listener thinks that connection isn't established, while it actually has established.
I've opened a PR in pion/webrtc that'd close this: https://github.com/pion/webrtc/pull/2702/files
The same error in updating ice connection states: https://github.com/pion/ice/pull/656/files
fixed by #2801 and other webrtc fixes.