go-livepeer icon indicating copy to clipboard operation
go-livepeer copied to clipboard

ai/live: Terminate stream on ICE disconnect.

Open j0sh opened this issue 8 months ago • 3 comments

While the disconnected state is not necessarily terminal - media may start flowing again before the ICE timeout [1] - this happens rarely enough [2] so let's just kill the peerconnection to avoid other timeouts later on in the process, eg segment copy.

[1] Easy way to test: plug in an Ethernet cable, disable WiFi, unplug the cable, wait a little more than 5 seconds, re-plug. Things should be fine in this case.

[2] Three occurrences of the state sequence "disconnected -> connected" in prod and 1 on staging, both in the past 30 days.

j0sh avatar Jun 18 '25 00:06 j0sh

So what will happen now? The frontend broadcast compoenent will retry it, right?

If the app is still there, yes it should - which happens rarely. (See the dashboards that I shared in Discord; typically less than 1% of user connections exhibit this behavior) More often, the app just goes away which is why we just terminate the stream now.

Testing with the Daydream app does show that it reconnects on a disconnect so that seems okay.

On a mechanical level: the app enters the disconnected state because there is no connectivity between the app and the server. While the server will tear down the connection with this PR, the DTLS close_notify usually won't make it to the client because there is no connectivity. My tiny client app times out into a failed state a few seconds after disconnect.

I am not sure exactly what are the conditions that make Daydream retry, but that seems to be OK.

j0sh avatar Jun 18 '25 18:06 j0sh

So what will happen now? The frontend broadcast compoenent will retry it, right?

If the app is still there, yes it should - which happens rarely. (See the dashboards that I shared in Discord; typically less than 1% of user connections exhibit this behavior) More often, the app just goes away which is why we just terminate the stream now.

Testing with the Daydream app does show that it reconnects on a disconnect so that seems okay.

On a mechanical level: the app enters the disconnected state because there is no connectivity between the app and the server. While the server will tear down the connection with this PR, the DTLS close_notify usually won't make it to the client because there is no connectivity. My tiny client app times out into a failed state a few seconds after disconnect.

I am not sure exactly what are the conditions that make Daydream retry, but that seems to be OK.

Ok, LGTM

leszko avatar Jun 23 '25 06:06 leszko

I have been re-reviewing the data - the rate of non-terminal disconnects is a bit higher when excluding e2e tests so I'm now leaning more towards https://github.com/livepeer/go-livepeer/pull/3642

j0sh avatar Jun 24 '25 04:06 j0sh

https://github.com/livepeer/go-livepeer/pull/3642 has been working well so closing this in favor of that

j0sh avatar Jul 15 '25 23:07 j0sh