go-livepeer icon indicating copy to clipboard operation
go-livepeer copied to clipboard

Capability-based job routing on T

Open cyberj0g opened this issue 3 years ago • 4 comments

Is your feature request related to a problem? Please describe. T node runs capability test on startup for each device. Currently, it does so on 'all-or-nothing' principle e.g. if 4 GPUs support H.265, and 5ths GPU does not - this T will not receive H.265 jobs.

Describe the solution you'd like

  • Capability test should create a list of capabilities per-device
  • Capabilities and capacities of T should be properly communicated to O (partially implemented by #2312)
  • T load balancer should be aware of each device's capabilities when assigning a job
  • Capability-related logic should support configurable load balancing modes and user session limits, as described here #2034

Remaining questions

  • There is no capacity test currently, does it make sense to run it on startup, similarly to livepeer_bench?
  • When estimating capacity for transcoding, which profiles should we use? Highest resolution or most common?
  • How to properly test for non-transcoding capabilities? Develop a plugin-based capability tester with custom test for each capability?

cyberj0g avatar Jun 06 '22 11:06 cyberj0g

Added to the list of features to evaluate for Q3

thomshutt avatar Jun 07 '22 09:06 thomshutt

  • There is no capacity test currently, does it make sense to run it on startup, similarly to livepeer_bench?

If this option is used, I want to be sure it's only run on initial startup and results are kept for future starts.

Titan Node does a capacity test running the 'livepeer_bench' with his pool software and it can take over an hour to complete on machines with multiple GPUs.

Titan's method of testing using the 'livepeer_bench' could be sped up if instead of starting with 1 session increasing by 1 session until the desired threshold is met if the test starts with 5 sessions if it passes then try 10. When the test results in less than desired transcoding threshold, decrease by 1 session until an acceptable result is achieved.

Advanced users that already know the capacities and capabilities of their hardware (especially useful for the many O that currently run multiple nodes) should have the option to add this info manually via cli or a config file.

papabear99 avatar Jun 07 '22 16:06 papabear99

Good point @papabear99. If we'll implement capacity test, we'll be using shorter segments and something like binary search over number of sessions, I think it may take less than a minute to run.

cyberj0g avatar Jun 07 '22 17:06 cyberj0g

if the test starts with 5 sessions if it passes then try 10

This is actually a good idea. I might implement a faster benchmarking process for the pool software.

we'll be using shorter segments

And this is also a great idea, I didn't even think about using less segments for each session limit in the benchmark.

This is great stuff!

Titan-Node avatar Jun 08 '22 03:06 Titan-Node