Capability-based job routing on T
Is your feature request related to a problem? Please describe. T node runs capability test on startup for each device. Currently, it does so on 'all-or-nothing' principle e.g. if 4 GPUs support H.265, and 5ths GPU does not - this T will not receive H.265 jobs.
Describe the solution you'd like
- Capability test should create a list of capabilities per-device
- Capabilities and capacities of T should be properly communicated to O (partially implemented by #2312)
- T load balancer should be aware of each device's capabilities when assigning a job
- Capability-related logic should support configurable load balancing modes and user session limits, as described here #2034
Remaining questions
- There is no capacity test currently, does it make sense to run it on startup, similarly to
livepeer_bench? - When estimating capacity for transcoding, which profiles should we use? Highest resolution or most common?
- How to properly test for non-transcoding capabilities? Develop a plugin-based capability tester with custom test for each capability?
Added to the list of features to evaluate for Q3
- There is no capacity test currently, does it make sense to run it on startup, similarly to
livepeer_bench?
If this option is used, I want to be sure it's only run on initial startup and results are kept for future starts.
Titan Node does a capacity test running the 'livepeer_bench' with his pool software and it can take over an hour to complete on machines with multiple GPUs.
Titan's method of testing using the 'livepeer_bench' could be sped up if instead of starting with 1 session increasing by 1 session until the desired threshold is met if the test starts with 5 sessions if it passes then try 10. When the test results in less than desired transcoding threshold, decrease by 1 session until an acceptable result is achieved.
Advanced users that already know the capacities and capabilities of their hardware (especially useful for the many O that currently run multiple nodes) should have the option to add this info manually via cli or a config file.
Good point @papabear99. If we'll implement capacity test, we'll be using shorter segments and something like binary search over number of sessions, I think it may take less than a minute to run.
if the test starts with 5 sessions if it passes then try 10
This is actually a good idea. I might implement a faster benchmarking process for the pool software.
we'll be using shorter segments
And this is also a great idea, I didn't even think about using less segments for each session limit in the benchmark.
This is great stuff!