go-livepeer icon indicating copy to clipboard operation
go-livepeer copied to clipboard

Remove sess from selector while removing it from the pool

Open leszko opened this issue 6 months ago • 12 comments

In the context of Realtime Video it's a refactor. In general I think we should always remove the session from the selector when we remove it from the pool.

@ad-astra-video what do you think?

leszko avatar May 21 '25 14:05 leszko

Codecov Report

:x: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 31.67732%. Comparing base (8baf330) to head (829ca0d).

Files with missing lines Patch % Lines
server/ai_session.go 0.00000% 2 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3581         +/-   ##
===================================================
+ Coverage   31.67570%   31.67732%   +0.00162%     
===================================================
  Files            159         159                 
  Lines          38951       38949          -2     
===================================================
  Hits           12338       12338                 
+ Misses         25721       25719          -2     
  Partials         892         892                 
Files with missing lines Coverage Δ
server/ai_live_video.go 0.00000% <ø> (ø)
server/ai_session.go 7.95107% <0.00000%> (-0.02439%) :arrow_down:

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8baf330...829ca0d. Read the comment docs.

Files with missing lines Coverage Δ
server/ai_live_video.go 0.00000% <ø> (ø)
server/ai_session.go 7.95107% <0.00000%> (-0.02439%) :arrow_down:
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 21 '25 14:05 codecov[bot]

This just streamlines the selection logic to avoid the for loop in AISessionPool.Select right?

The next .Select on the selector removes the session on next pass but its not entirely clear that is what is happening. When a session is selected from unknownSessions it is removed from the Selector.sessions list and never returned because Complete is never called for the session. Seems kind of silly now that I am looking at it to remove it that way :).

Curious if selecting sessions is failing because of this? A max try of 20 was added recently that could be impacting this.

If we are doing this then some other updates would make sense to include in my opinion:

I think we do not need the selection loop here if the session is removed from the selector https://github.com/livepeer/go-livepeer/blob/218fd2a0b3cdae110d5ff650f0df0b0c3faef14f/server/ai_session.go#L49-L56 This is the only thing that could require a re-selection. Which I think we do not need after this update. https://github.com/livepeer/go-livepeer/blob/218fd2a0b3cdae110d5ff650f0df0b0c3faef14f/server/ai_session.go#L67-L72 We would just need these parts with no for loop wrapping it right? https://github.com/livepeer/go-livepeer/blob/218fd2a0b3cdae110d5ff650f0df0b0c3faef14f/server/ai_session.go#L55-L66 https://github.com/livepeer/go-livepeer/blob/218fd2a0b3cdae110d5ff650f0df0b0c3faef14f/server/ai_session.go#L73-L76 I think would make sense to run this for all selectors I think to enable the Orchestrator to get back in the pool before it empties or is less than half the starting size https://github.com/livepeer/go-livepeer/blob/218fd2a0b3cdae110d5ff650f0df0b0c3faef14f/server/ai_session.go#L260-L262 If have periodic refresh for all selectors, probably do not need this check anymore which can remove the floating lock here too https://github.com/livepeer/go-livepeer/blob/218fd2a0b3cdae110d5ff650f0df0b0c3faef14f/server/ai_session.go#L341-L348

ad-astra-video avatar May 28 '25 15:05 ad-astra-video

This just streamlines the selection logic to avoid the for loop in AISessionPool.Select right?

Hmm, I think it works differently for the Realtime Video AI and for AI Jobs. For Realtime Video we do have periodic session refresh, for AI Jobs we don't.

Anyway, I think it's not related to this PR. This PR just makes sure we also remove the session from the pool while removing it from the selector. @ad-astra-video Do you think we should ever have a session in the pool while not having it in the selector?

leszko avatar Jun 02 '25 08:06 leszko

I don't see a reason to have the session in the selector but not in the pool. It is removed from the selector on the next round of selection anyways if it is selected.

My only conern was the for loop in Select is not needed anymore with this change because the continue can never be hit since the session is removed from the pool sessMap and the selector at the same time.

ad-astra-video avatar Jun 04 '25 22:06 ad-astra-video

for try := 1;

Ahh, I see it now. I think you're right. I removed it. Do you think it's ok now?

leszko avatar Jun 05 '25 06:06 leszko

Holding off with merging this PR, because I still see try=2 for our Realtime Video pipeline, which should not happen according tho the discussion above.

leszko avatar Jun 06 '25 08:06 leszko

Are you still seeing try=2? I think its a race condition because the current sel.warmPool.selector.Remove(params.liveParams.sess.BroadcastSession) is not under the pool.mu.Lock(). WDYT?

Moving the session removal to the Remove function puts it under a lock.

ad-astra-video avatar Jun 19 '25 12:06 ad-astra-video

sel.warmPool.selector.Remove(params.l

Good point. I've fired: https://github.com/livepeer/go-livepeer/pull/3640

Then, we can check if we still see try=2 in the logs. Thanks @ad-astra-video

leszko avatar Jun 23 '25 12:06 leszko

sel.warmPool.selector.Remove(params.l

Good point. I've fired: #3640

Then, we can check if we still see try=2 in the logs. Thanks @ad-astra-video

For some reason, I still see it happening. When I have some time, I'll dig deeper into this.

leszko avatar Jun 25 '25 08:06 leszko

Maybe we can just keep the loop and use a lower count. 5 maybe? Will allow for more time to see it work in the wild and what causes the 2nd try on selection.

ad-astra-video avatar Jul 17 '25 12:07 ad-astra-video

Maybe we can just keep the loop and use a lower count. 5 maybe? Will allow for more time to see it work in the wild and what causes the 2nd try on selection.

Ok, changed to 5. PTAL.

leszko avatar Jul 17 '25 12:07 leszko

@leszko I lost track of this. Sorry about that.

ad-astra-video avatar Nov 06 '25 12:11 ad-astra-video