Enable setting Max Sessions on per GPU basis with the addition of a -perGpuMax parameter
Is your feature request related to a problem? Please describe.
Currently, it is not possible to specify the maximum sessions on a per GPU basis. This presents a problem in that an Orchestrator/Transcoder can't optimize the number of sessions their node can handle when running a node with multiple GPUs of different capacities.
Describe the solution you'd like
I would like to have the option to set a -perGpuMax value in Livepeer. This would allow an Orchestror/Transcoder to set a per card -maxSessions value. The -perGpuMax values would be set in conjunction with the -nvidia flag.
-nvidia 0,1,2 -perGpuMax 18,15,22
In the above GPU 0 would be set to 18 max sessions GPU 1 15 max sessions GPU 2 22 max sessions
Ideally the sum of the -perGpuMax would be automatically used to set the global -maxSessions value.
Describe alternatives you've considered
The current method to accomplish this is to run multiple instances of Livepeer on a single computer to set values on a per card basis. Adding the -perGpuMax parameter would simplify setup for multi GPU Orchestrator/Transcoders that have GPUs with varying capacities and be a more elegent solution than running multiple instances of Livepeer just to have a more ganular method of optimizing hardware.
Additional context
This can be expanded to also allow -perGpu settings for AI capable, prefer GPUs with newer NVENCs for higher quality output, to assign jobs to GPUs in order of preference i.e. -gpuPreference 2,0,1
I second this. The ability to define which GPU should attempt how many max sessions without having to invoke separate transcode only processes makes management easier.
@papabear99
Seems like there are two issues here:
- set per GPU Max
- specify GPU preference
is that right?
Seems like specifying GPU preference is a rabbit hole - what if you prefer one GPU for certain job types and another for other job types?
I see your point regarding setting preference for different tasks and I agree, however I would still find it beneficial to be able to set GPU preference for all jobs before sending jobs to lower priority GPUs for setups that have GPUs with different speeds.
i.e. I have 3 GPUs in my O/T 2x GTX 1070s and 1x RTX 4000. The GTX 1070s are ~33% faster than the RTX 4000 so I would prefer, and think it would benefit the network, if the 1070s were prioritized and only sent work to the RTX 4000 when the 1070s were at their specified capacity.
Seems like specifying GPU preference is a rabbit hole - what if you prefer one GPU for certain job types and another for other job types?
Why so? If Livepeer will have differing payloads in the future, say Inference / Tensor / ML, besides video transcoding, O's should be able to prioritize by GPU and payload type. Enabling freedom of choice for resource allocation to O's has no downsides.
If Livepeer will have differing payloads in the future, say Inference / Tensor / ML, besides video transcoding, O's should be able to prioritize by GPU and payload type.
yeah this is exactly my point. with multiple GPUs and multiple job types, it becomes pretty complex pretty rapidly. This is absolutely where we should be headed, it's just a nontrivial challenge
It's a good idea to have per-device session limits, -maxSessions attribute could be reused for that. It also feels like this feature would be most useful, when paired with ability to switch load balancing strategy from 'select least loaded device' to 'select devices in the order they are specified, unless at capacity' e.g.
livepeer ... -nvidia 0,1,2 -maxSessions 20,15,10 -lbMode load|priority
The downside is that one needs to know how many sessions each device could handle. It can be tested with livepeer_bench for standard transcoding, but would be much harder to properly estimate for custom capabilities.
There's already basic support on O for capacity and capability-based job routing, it just requires a bit more work on T side for per-device capability accounting and capability-based load balancing.
Sounds great!
Thanks @cyberj0g and @papabear99, added to the list of features to evaluate for Q3
livepeer ... -nvidia 0,1,2 -maxSessions 20,15,10 -lbMode load|priority
This is a great idea! Running a bench per GPU is already what most people do to compare performance of each GPU anyway.