go-livepeer Capability-based job routing on T

Is your feature request related to a problem? Please describe. T node runs capability test on startup for each device. Currently, it does so on 'all-or-nothing' principle e.g. if 4 GPUs support H.265, and 5ths GPU does not - this T will not receive H.265 jobs.

Describe the solution you'd like

Capability test should create a list of capabilities per-device
Capabilities and capacities of T should be properly communicated to O (partially implemented by #2312)
T load balancer should be aware of each device's capabilities when assigning a job
Capability-related logic should support configurable load balancing modes and user session limits, as described here #2034

Remaining questions

There is no capacity test currently, does it make sense to run it on startup, similarly to livepeer_bench?
When estimating capacity for transcoding, which profiles should we use? Highest resolution or most common?
How to properly test for non-transcoding capabilities? Develop a plugin-based capability tester with custom test for each capability?

Jun 06 '22 11:06 cyberj0g

Added to the list of features to evaluate for Q3

Jun 07 '22 09:06 thomshutt

There is no capacity test currently, does it make sense to run it on startup, similarly to livepeer_bench?

If this option is used, I want to be sure it's only run on initial startup and results are kept for future starts.

Titan Node does a capacity test running the 'livepeer_bench' with his pool software and it can take over an hour to complete on machines with multiple GPUs.

Titan's method of testing using the 'livepeer_bench' could be sped up if instead of starting with 1 session increasing by 1 session until the desired threshold is met if the test starts with 5 sessions if it passes then try 10. When the test results in less than desired transcoding threshold, decrease by 1 session until an acceptable result is achieved.

Advanced users that already know the capacities and capabilities of their hardware (especially useful for the many O that currently run multiple nodes) should have the option to add this info manually via cli or a config file.

Jun 07 '22 16:06 papabear99

Good point @papabear99. If we'll implement capacity test, we'll be using shorter segments and something like binary search over number of sessions, I think it may take less than a minute to run.

Jun 07 '22 17:06 cyberj0g

if the test starts with 5 sessions if it passes then try 10

This is actually a good idea. I might implement a faster benchmarking process for the pool software.

we'll be using shorter segments

And this is also a great idea, I didn't even think about using less segments for each session limit in the benchmark.

This is great stuff!

Jun 08 '22 03:06 Titan-Node