go-livepeer icon indicating copy to clipboard operation
go-livepeer copied to clipboard

Disconnecting a remote T causes O to lose all transcoders

Open eliteprox opened this issue 2 years ago • 1 comments

Describe the bug

When operating an orchestrator with multiple remote transcoders, disconnecting one remote T can sometimes cause orchestrator to lose all transcoders and go into a no transcoders available state.

To Reproduce Steps to reproduce the behavior:

  1. Run standalone orchestrator
  2. Run two transcoder processes on localhost with orchestrator
  3. Run a third transcoder process on a separate IP address connected to the same orchestrator.
  4. Stop the third transcoder process while orchestrator has streams running on all three transcoders.
  5. Orchestrator logs are attached. orchestratorlogs.txt local-transcoder2.txt remote-transcoder3.txt

Expected behavior The stopped transcoder should disconnect without affecting workflow to the other transcoders

Desktop (please complete the following information):

  • OS: Ubunutu Linux 22.04.01 on both machines
  • Version 0.5.34

Additional context See attached logs. orchestratorlogs.txt shows the orchestrator logs during this time period.

  • We stopped remote transcoder 70.132.135.145:34772 on purpose using systemctl stop
Sep 21 14:39:39 whitebox systemd[1]: Stopping Livepeer Trancoder 1 Service...
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183235    3963 ot_rpc.go:116] Exiting Livepeer Transcoder: terminated
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183300    3963 ot_rpc.go:128] End of stream receive cycle because of err="execution interrupted", waiting for running transcode jobs to complete
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183372    3963 ot_rpc.go:60] Unregistering transcoder: execution interrupted
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183377    3963 ot_rpc.go:62] Terminating transcoder because of execution interrupted
Sep 21 14:41:09 whitebox systemd[1]: transcoder1.service: State 'stop-sigterm' timed out. Killing.
Sep 21 14:41:09 whitebox systemd[1]: transcoder1.service: Killing process 3963 (livepeer) with signal SIGKILL.
  • no transcoders available begins to show at Sep 21 14:39:49
Sep 21 14:39:48 blackbox livepeer[358249]: I0921 14:39:48.949666  358249 orchestrator.go:770] manifestID=e101ced7-65ea-4df6-96ae-31e2f143fcde seqNo=750 orchSessionID=da65b479 clientIP=0.0.0.130 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=127.0.0.1:59952 segments=2 taskId=1309469 fname=https://70.132.135.146:8935/stream/da65b479/750.tempfile dur=101.090851ms
Sep 21 14:39:49 blackbox livepeer[358249]: I0921 14:39:49.329143  358249 orchestrator.go:770] manifestID=147bb333-d2e6-4f0c-b6c6-de6eaf0aa93c seqNo=8357 orchSessionID=cf465ddb clientIP=0.0.0.98 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=127.0.0.1:59944 segments=1 taskId=1309470 fname=https://70.132.135.146:8935/stream/cf465ddb/8357.tempfile dur=53.886134ms
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.373572  358249 orchestrator.go:559] manifestID=d9918feb-bb09-41eb-8668-c8d1c43cd214 seqNo=5 orchSessionID=2aaaf010 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E clientIP=0.0.0.193 Error transcoding segName=https://70.132.135.146:8935/stream/2aaaf010/5.tempfile err="no transcoders available"

Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.374196  358249 segment_rpc.go:230] manifestID=d9918feb-bb09-41eb-8668-c8d1c43cd214 seqNo=5 orchSessionID=2aaaf010 clientIP=0.0.0.193 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="no transcoders available"
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.745008  358249 orchestrator.go:559] manifestID=09ec9btof2jkzzin seqNo=57 orchSessionID=2a965663 clientIP=0.0.0.196 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Error transcoding segName=https://70.132.135.146:8935/stream/2a965663/57.tempfile err="no transcoders available"
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.745111  358249 segment_rpc.go:230] manifestID=09ec9btof2jkzzin seqNo=57 orchSessionID=2a965663 clientIP=0.0.0.196 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="no transcoders available"
  • transcoder1.txt and transcoder2.txt show logs from two transcoders on localhost that stopped receiving work.

eliteprox avatar Sep 23 '22 01:09 eliteprox

Thanks for reporting. Couldn't reproduce this issue on a few initial attempts. Tried with both v0.5.34 and master. There are only two spots in code, which could return No transcoders available error. I suspect there may be a data race somewhere around streamSessions and remoteTranscoders maps, maybe it happens only when transcoders are at full capacity.

@eliteprox would appreciate some input to investigate further:

  1. Is this error fuzzy, or consistently reproducible?
  2. Is there a difference in capabilities between local and remote transcoders (different GPUs)?
  3. Is it reproducible with only local transcoders?
  4. Could you share a full command line for O and Ts?

cyberj0g avatar Oct 03 '22 09:10 cyberj0g

I believe this to have the same cause as #2706, which is currently fixed.

cyberj0g avatar Jan 17 '23 08:01 cyberj0g