webrtc icon indicating copy to clipboard operation
webrtc copied to clipboard

Implement DTLS restart

Open Antonito opened this issue 4 years ago • 7 comments

https://github.com/pion/srtp/pull/148 is needed before being able to merge

~~The unit test needs https://github.com/pion/transport/pull/141~~

Fixes #1636

~~I'm not sure about if t.state != DTLSTransportStateNew && t.state != DTLSTransportStateClosed, one other option would be to set the state to New before restarting, but I don't want to trigger a StateChanged event – which could be wrong.~~ Turns out that's what libwebrtc does

Also the condition on which DTLS is restarted might be too permissive – but it doesn't seem to break any test so far.

Antonito avatar Jun 26 '21 12:06 Antonito

Codecov Report

Merging #1846 (ce2fd9b) into master (1765e9b) will decrease coverage by 6.91%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1846      +/-   ##
==========================================
- Coverage   76.89%   69.98%   -6.92%     
==========================================
  Files          87       64      -23     
  Lines        8937     3258    -5679     
==========================================
- Hits         6872     2280    -4592     
+ Misses       1647      860     -787     
+ Partials      418      118     -300     
Flag Coverage Δ
go ?
wasm 69.98% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
constants.go 0.00% <0.00%> (-100.00%) :arrow_down:
icecandidatepair.go 0.00% <0.00%> (-81.82%) :arrow_down:
rtpcodec.go 13.33% <0.00%> (-80.00%) :arrow_down:
stats.go 0.00% <0.00%> (-75.00%) :arrow_down:
track_local.go 0.00% <0.00%> (-66.67%) :arrow_down:
internal/mux/muxfunc.go 9.09% <0.00%> (-63.64%) :arrow_down:
icemux.go 0.00% <0.00%> (-54.55%) :arrow_down:
atomicbool.go 57.14% <0.00%> (-42.86%) :arrow_down:
sdpsemantics.go 0.00% <0.00%> (-34.49%) :arrow_down:
internal/mux/endpoint.go 24.13% <0.00%> (-34.49%) :arrow_down:
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1765e9b...ce2fd9b. Read the comment docs.

codecov[bot] avatar Jun 26 '21 12:06 codecov[bot]

@Antonito This is absolutely amazing work, your PRs are the best I have seen in months. Gives me the energy to get back into it :)

I think we could make the test a lot simpler.

  • Create two PeerConnections and connect them.
  • Send one RTP packet
  • Create a new PeerConnection and generate an offer, and send it to one of the PeerConnections in the first flow.
  • Send one RTP packet

I think that gets us the coverage we need and keeps the LoC low still? Push back if I am wrong!

Sean-Der avatar Jun 27 '21 00:06 Sean-Der

@Antonito This is absolutely amazing work, your PRs are the best I have seen in months. Gives me the energy to get back into it :)

I think we could make the test a lot simpler.

  • Create two PeerConnections and connect them.
  • Send one RTP packet
  • Create a new PeerConnection and generate an offer, and send it to one of the PeerConnections in the first flow.
  • Send one RTP packet

I think that gets us the coverage we need and keeps the LoC low still? Push back if I am wrong!

So, I tried to rewrite the unit test in a simpler way, and it uncovered several problems. I'm not sure if there are caused by the DTLS restart or were already there (I'm thinking the later, for most of them).

  • Sending a single RTP packet doesn't trigger an OnTrack event on the other peer. Sending two in a row doesn't either, but sending 1 / waiting a few ms / sendign another one works.
  • When ran multiple times in a row, the test sometimes logs pc ERROR: 2021/06/28 13:30:33 Incoming unhandled RTP ssrc(1595441956), OnTrack will not be fired. single media section has an explicit SSRC. I initially thought this was a difference in behavior with the test I already pushed (since with your method, the first peer connection doesn't go to disconnected or failed before the other one connects), but turns out it's also happening with this one – I just had lucky runs. It doesn't seem to be a race condition, tsan didn't detect anything.
  • I found out weird behaviors with DataChannels
    • I added a call to CreateDataChannel, behavior unchanged (as expected)
    • I then set an empty OnDataChannel callback -> this causes the test to sometimes hang (again I'm not doing anything more with DataChannels here)
    • I then tried to send data from ClientB and read with ClientA1 and ClientA2, sometime it hangs, but when it doesn't:
      • As expected, ClientA1 can read
      • After ClientA2 connects, ClientB tries to send data – which causes a io: read/write on closed pipe error log
      • ClientA2 can't read, since ClientB can't send anything (this is most likely an issue with DTLS restart)

Antonito avatar Jun 28 '21 11:06 Antonito

I rewrote the unit test without pion/transport, and split it in two very similar tests, for debugging purpose:

  • TestPeerConnection_DTLS_Restart_MediaOnly
  • TestPeerConnection_DTLS_Restart_MediaAndDataChannel

Before merging, we should only keep TestPeerConnection_DTLS_Restart_MediaAndDataChannel.

SCTP restart is now implemented and DataChannels are re-opened, if needed.

There seems to be some kind of race condition, but I really don't think it's related to the stuff this PR modifies – I think the unit tests just show an already existing problem. It's what I was describing above, and now have a bit more validation of what's really happening.

The TestPeerConnection_DTLS_Restart_MediaOnly behaves correctly, in go test. To make it work with tsan, I had to add an extra: https://github.com/pion/webrtc/blob/0d1e50c3d6cc904f52b596f9f8cd6906eb6e10c9/peerconnection_media_test.go#L1329 I think this is a consequence of the first point I was developing in my previous message (about RTP packets).

TestPeerConnection_DTLS_Restart_MediaAndDataChannel behaves correctly with go test, but hangs with tsan. I also tried to add a few extra RTP packets, but it doesn't change any thing. Somehow, the hang seems to be caused by the simple fact of calling CreateDataChannel and OnDataChannel.

Antonito avatar Jun 29 '21 15:06 Antonito

Following up discussion with @davidzhao, I updated both this PR and https://github.com/pion/srtp/pull/148.

In order to get some progress on this, I think we should first focus on merging https://github.com/pion/srtp/pull/148 – we need to figure out @Sean-Der 's concerns about rolloverCounter and lastSequenceNumber. I was careful and really don't think this one introduces any security issue, but wouldn't mind a second opinion :)

As for this PR, it requires a bit more work, there are still issues with TestPeerConnection_DTLS_Restart_MediaAndDataChannel, and I'm far less confident about security issues here. Unfortunately I don't have much time to dedicate to this for now, I'd gladly welcome any help 😁

Antonito avatar Mar 04 '22 07:03 Antonito

Thanks for picking this back up @Antonito! We'd love to collaborate on this. Will have a look in a few weeks.

davidzhao avatar Mar 04 '22 18:03 davidzhao