go-livepeer icon indicating copy to clipboard operation
go-livepeer copied to clipboard

Check ICE State before suspending Orchestrator

Open leszko opened this issue 8 months ago • 3 comments

Before sending a "out segment timeout" error (and suspending Orchestrator), first check if the ICE Connection is connected.

leszko avatar Jun 06 '25 07:06 leszko

Codecov Report

Attention: Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.

Project coverage is 30.77426%. Comparing base (e098564) to head (bcc5662). Report is 44 commits behind head on master.

Files with missing lines Patch % Lines
server/ai_live_video.go 0.00000% 9 Missing :warning:
media/whip_connection.go 0.00000% 8 Missing :warning:
media/whip_server.go 0.00000% 2 Missing :warning:
server/ai_mediaserver.go 0.00000% 2 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3614         +/-   ##
===================================================
- Coverage   30.78230%   30.77426%   -0.00804%     
===================================================
  Files            153         153                 
  Lines          45916       45928         +12     
===================================================
  Hits           14134       14134                 
- Misses         30958       30970         +12     
  Partials         824         824                 
Files with missing lines Coverage Δ
server/ai_process.go 1.67926% <ø> (ø)
media/whip_server.go 0.00000% <0.00000%> (ø)
server/ai_mediaserver.go 4.71597% <0.00000%> (-0.00506%) :arrow_down:
media/whip_connection.go 76.70455% <0.00000%> (-3.65259%) :arrow_down:
server/ai_live_video.go 0.00000% <0.00000%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e098564...bcc5662. Read the comment docs.

Files with missing lines Coverage Δ
server/ai_process.go 1.67926% <ø> (ø)
media/whip_server.go 0.00000% <0.00000%> (ø)
server/ai_mediaserver.go 4.71597% <0.00000%> (-0.00506%) :arrow_down:
media/whip_connection.go 76.70455% <0.00000%> (-3.65259%) :arrow_down:
server/ai_live_video.go 0.00000% <0.00000%> (ø)
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 06 '25 08:06 codecov[bot]

Thanks for the comments Josh!

Thanks for this PR - in addition to the comment about reordering the conn, is this PR still needed now that we have #3613 ? Specifically, I am curious about what exactly this PR fixes, eg what are the the scenarios when:

  1. we time out via the 15 second segment timeout
  2. we do not have an ICE timeout
  3. the state is not connected and we haven't cleaned up the stream

I think that we still do need it. Look at this example:

12:52:24.504 ice connection state changed state=disconnected
12:52:33.889 copy operation timed out: context deadline exceeded 
12:52:33.889 ice connection state changed state=closed

So, we may either further reduce the ICE Timeout or use the ICE state to decide what to do if the out segment does not arrive by the given time. I think that from the user perspective waiting 15s for the out segment is a lot. Wdyt?

But rather than checking for specific ICE states, which I feel are bound to change, we probably should tighten up our overall notion of when a stream is actually connected and when it isn't; I started a bit of that work in #3590 . This might also involve stricter ICE (peerconnection) checks, eg explicitly closing on the disconnected and failed states but we need to do a little more verification before going that far; I'll try to get that in shortly.

Yeah, I think it's a good idea. We need some general conditions and rules of when the stream is "connected". And the same for how long we "tolerate" out segments not coming back to the user before we stop or restart the whole piepeline.

leszko avatar Jun 10 '25 07:06 leszko

12:52:24.504 ice connection state changed state=disconnected
12:52:33.889 copy operation timed out: context deadline exceeded 
12:52:33.889 ice connection state changed state=closed

use the ICE state to decide what to do if the out segment does not arrive by the given time. I think that from the user perspective waiting 15s for the out segment is a lot

In this case, input stopped coming in (state=disconnected). So we are not going to get output segments, of course.

It feels odd to attribute stream termination to "no output segments" when the issue is we actually had no input. We really should tighten up our criteria of what's considered connected and tear down the rest of the stream sooner than that.

Consider the case when we do orchestrator swapping (instead of stream termination like we do now). The output segment times out, so we swap orchestrators ... then what? Wait some more until we swap again or have an official disconnect?

And the same for how long we "tolerate" out segments not coming back to the user before we stop or restart the whole piepeline.

Agreed, but we should also have a clear distinction when the issue is actually from the O itself, rather than some other input issue.

j0sh avatar Jun 11 '25 06:06 j0sh

Closing for now, since there are other PRs from @j0sh that covers this.

leszko avatar Jul 15 '25 13:07 leszko