webrtc
webrtc copied to clipboard
Implement DTLS restart
https://github.com/pion/srtp/pull/148 is needed before being able to merge
~~The unit test needs https://github.com/pion/transport/pull/141~~
Fixes #1636
~~I'm not sure about if t.state != DTLSTransportStateNew && t.state != DTLSTransportStateClosed, one other option would be to set the state to New before restarting, but I don't want to trigger a StateChanged event – which could be wrong.~~
Turns out that's what libwebrtc does
Also the condition on which DTLS is restarted might be too permissive – but it doesn't seem to break any test so far.
Codecov Report
Merging #1846 (ce2fd9b) into master (1765e9b) will decrease coverage by
6.91%. The diff coverage isn/a.
@@ Coverage Diff @@
## master #1846 +/- ##
==========================================
- Coverage 76.89% 69.98% -6.92%
==========================================
Files 87 64 -23
Lines 8937 3258 -5679
==========================================
- Hits 6872 2280 -4592
+ Misses 1647 860 -787
+ Partials 418 118 -300
| Flag | Coverage Δ | |
|---|---|---|
| go | ? |
|
| wasm | 69.98% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| constants.go | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| icecandidatepair.go | 0.00% <0.00%> (-81.82%) |
:arrow_down: |
| rtpcodec.go | 13.33% <0.00%> (-80.00%) |
:arrow_down: |
| stats.go | 0.00% <0.00%> (-75.00%) |
:arrow_down: |
| track_local.go | 0.00% <0.00%> (-66.67%) |
:arrow_down: |
| internal/mux/muxfunc.go | 9.09% <0.00%> (-63.64%) |
:arrow_down: |
| icemux.go | 0.00% <0.00%> (-54.55%) |
:arrow_down: |
| atomicbool.go | 57.14% <0.00%> (-42.86%) |
:arrow_down: |
| sdpsemantics.go | 0.00% <0.00%> (-34.49%) |
:arrow_down: |
| internal/mux/endpoint.go | 24.13% <0.00%> (-34.49%) |
:arrow_down: |
| ... and 36 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 1765e9b...ce2fd9b. Read the comment docs.
@Antonito This is absolutely amazing work, your PRs are the best I have seen in months. Gives me the energy to get back into it :)
I think we could make the test a lot simpler.
- Create two PeerConnections and connect them.
- Send one RTP packet
- Create a new PeerConnection and generate an offer, and send it to one of the PeerConnections in the first flow.
- Send one RTP packet
I think that gets us the coverage we need and keeps the LoC low still? Push back if I am wrong!
@Antonito This is absolutely amazing work, your PRs are the best I have seen in months. Gives me the energy to get back into it :)
I think we could make the test a lot simpler.
- Create two PeerConnections and connect them.
- Send one RTP packet
- Create a new PeerConnection and generate an offer, and send it to one of the PeerConnections in the first flow.
- Send one RTP packet
I think that gets us the coverage we need and keeps the LoC low still? Push back if I am wrong!
So, I tried to rewrite the unit test in a simpler way, and it uncovered several problems. I'm not sure if there are caused by the DTLS restart or were already there (I'm thinking the later, for most of them).
- Sending a single RTP packet doesn't trigger an
OnTrackevent on the other peer. Sending two in a row doesn't either, but sending 1 / waiting a few ms / sendign another one works. - When ran multiple times in a row, the test sometimes logs
pc ERROR: 2021/06/28 13:30:33 Incoming unhandled RTP ssrc(1595441956), OnTrack will not be fired. single media section has an explicit SSRC. I initially thought this was a difference in behavior with the test I already pushed (since with your method, the first peer connection doesn't go todisconnectedorfailedbefore the other one connects), but turns out it's also happening with this one – I just had lucky runs. It doesn't seem to be a race condition, tsan didn't detect anything. - I found out weird behaviors with DataChannels
- I added a call to
CreateDataChannel, behavior unchanged (as expected) - I then set an empty
OnDataChannelcallback -> this causes the test to sometimes hang (again I'm not doing anything more with DataChannels here) - I then tried to send data from
ClientBand read withClientA1andClientA2, sometime it hangs, but when it doesn't:- As expected, ClientA1 can read
- After ClientA2 connects, ClientB tries to send data – which causes a
io: read/write on closed pipeerror log - ClientA2 can't read, since ClientB can't send anything (this is most likely an issue with DTLS restart)
- I added a call to
I rewrote the unit test without pion/transport, and split it in two very similar tests, for debugging purpose:
- TestPeerConnection_DTLS_Restart_MediaOnly
- TestPeerConnection_DTLS_Restart_MediaAndDataChannel
Before merging, we should only keep TestPeerConnection_DTLS_Restart_MediaAndDataChannel.
SCTP restart is now implemented and DataChannels are re-opened, if needed.
There seems to be some kind of race condition, but I really don't think it's related to the stuff this PR modifies – I think the unit tests just show an already existing problem. It's what I was describing above, and now have a bit more validation of what's really happening.
The TestPeerConnection_DTLS_Restart_MediaOnly behaves correctly, in go test. To make it work with tsan, I had to add an extra: https://github.com/pion/webrtc/blob/0d1e50c3d6cc904f52b596f9f8cd6906eb6e10c9/peerconnection_media_test.go#L1329
I think this is a consequence of the first point I was developing in my previous message (about RTP packets).
TestPeerConnection_DTLS_Restart_MediaAndDataChannel behaves correctly with go test, but hangs with tsan. I also tried to add a few extra RTP packets, but it doesn't change any thing. Somehow, the hang seems to be caused by the simple fact of calling CreateDataChannel and OnDataChannel.
Following up discussion with @davidzhao, I updated both this PR and https://github.com/pion/srtp/pull/148.
In order to get some progress on this, I think we should first focus on merging https://github.com/pion/srtp/pull/148 – we need to figure out @Sean-Der 's concerns about rolloverCounter and lastSequenceNumber. I was careful and really don't think this one introduces any security issue, but wouldn't mind a second opinion :)
As for this PR, it requires a bit more work, there are still issues with TestPeerConnection_DTLS_Restart_MediaAndDataChannel, and I'm far less confident about security issues here. Unfortunately I don't have much time to dedicate to this for now, I'd gladly welcome any help 😁
Thanks for picking this back up @Antonito! We'd love to collaborate on this. Will have a look in a few weeks.