go-livepeer
go-livepeer copied to clipboard
Disconnecting a remote T causes O to lose all transcoders
Describe the bug
When operating an orchestrator with multiple remote transcoders, disconnecting one remote T can sometimes cause orchestrator to lose all transcoders and go into a no transcoders available
state.
To Reproduce Steps to reproduce the behavior:
- Run standalone orchestrator
- Run two transcoder processes on localhost with orchestrator
- Run a third transcoder process on a separate IP address connected to the same orchestrator.
- Stop the third transcoder process while orchestrator has streams running on all three transcoders.
- Orchestrator logs are attached. orchestratorlogs.txt local-transcoder2.txt remote-transcoder3.txt
Expected behavior The stopped transcoder should disconnect without affecting workflow to the other transcoders
Desktop (please complete the following information):
- OS: Ubunutu Linux 22.04.01 on both machines
- Version 0.5.34
Additional context See attached logs. orchestratorlogs.txt shows the orchestrator logs during this time period.
- We stopped remote transcoder 70.132.135.145:34772 on purpose using
systemctl stop
Sep 21 14:39:39 whitebox systemd[1]: Stopping Livepeer Trancoder 1 Service...
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183235 3963 ot_rpc.go:116] Exiting Livepeer Transcoder: terminated
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183300 3963 ot_rpc.go:128] End of stream receive cycle because of err="execution interrupted", waiting for running transcode jobs to complete
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183372 3963 ot_rpc.go:60] Unregistering transcoder: execution interrupted
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183377 3963 ot_rpc.go:62] Terminating transcoder because of execution interrupted
Sep 21 14:41:09 whitebox systemd[1]: transcoder1.service: State 'stop-sigterm' timed out. Killing.
Sep 21 14:41:09 whitebox systemd[1]: transcoder1.service: Killing process 3963 (livepeer) with signal SIGKILL.
-
no transcoders available
begins to show atSep 21 14:39:49
Sep 21 14:39:48 blackbox livepeer[358249]: I0921 14:39:48.949666 358249 orchestrator.go:770] manifestID=e101ced7-65ea-4df6-96ae-31e2f143fcde seqNo=750 orchSessionID=da65b479 clientIP=0.0.0.130 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=127.0.0.1:59952 segments=2 taskId=1309469 fname=https://70.132.135.146:8935/stream/da65b479/750.tempfile dur=101.090851ms
Sep 21 14:39:49 blackbox livepeer[358249]: I0921 14:39:49.329143 358249 orchestrator.go:770] manifestID=147bb333-d2e6-4f0c-b6c6-de6eaf0aa93c seqNo=8357 orchSessionID=cf465ddb clientIP=0.0.0.98 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=127.0.0.1:59944 segments=1 taskId=1309470 fname=https://70.132.135.146:8935/stream/cf465ddb/8357.tempfile dur=53.886134ms
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.373572 358249 orchestrator.go:559] manifestID=d9918feb-bb09-41eb-8668-c8d1c43cd214 seqNo=5 orchSessionID=2aaaf010 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E clientIP=0.0.0.193 Error transcoding segName=https://70.132.135.146:8935/stream/2aaaf010/5.tempfile err="no transcoders available"
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.374196 358249 segment_rpc.go:230] manifestID=d9918feb-bb09-41eb-8668-c8d1c43cd214 seqNo=5 orchSessionID=2aaaf010 clientIP=0.0.0.193 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="no transcoders available"
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.745008 358249 orchestrator.go:559] manifestID=09ec9btof2jkzzin seqNo=57 orchSessionID=2a965663 clientIP=0.0.0.196 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Error transcoding segName=https://70.132.135.146:8935/stream/2a965663/57.tempfile err="no transcoders available"
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.745111 358249 segment_rpc.go:230] manifestID=09ec9btof2jkzzin seqNo=57 orchSessionID=2a965663 clientIP=0.0.0.196 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="no transcoders available"
- transcoder1.txt and transcoder2.txt show logs from two transcoders on localhost that stopped receiving work.
Thanks for reporting. Couldn't reproduce this issue on a few initial attempts. Tried with both v0.5.34
and master
.
There are only two spots in code, which could return No transcoders available
error. I suspect there may be a data race somewhere around streamSessions
and remoteTranscoders
maps, maybe it happens only when transcoders are at full capacity.
@eliteprox would appreciate some input to investigate further:
- Is this error fuzzy, or consistently reproducible?
- Is there a difference in capabilities between local and remote transcoders (different GPUs)?
- Is it reproducible with only local transcoders?
- Could you share a full command line for O and Ts?
I believe this to have the same cause as #2706, which is currently fixed.