go-livepeer Enable setting Max Sessions on per GPU basis with the addition of a -perGpuMax parameter

Is your feature request related to a problem? Please describe.

Currently, it is not possible to specify the maximum sessions on a per GPU basis. This presents a problem in that an Orchestrator/Transcoder can't optimize the number of sessions their node can handle when running a node with multiple GPUs of different capacities.

Describe the solution you'd like

I would like to have the option to set a -perGpuMax value in Livepeer. This would allow an Orchestror/Transcoder to set a per card -maxSessions value. The -perGpuMax values would be set in conjunction with the -nvidia flag.

-nvidia 0,1,2 -perGpuMax 18,15,22

In the above GPU 0 would be set to 18 max sessions GPU 1 15 max sessions GPU 2 22 max sessions

Ideally the sum of the -perGpuMax would be automatically used to set the global -maxSessions value.

Describe alternatives you've considered

The current method to accomplish this is to run multiple instances of Livepeer on a single computer to set values on a per card basis. Adding the -perGpuMax parameter would simplify setup for multi GPU Orchestrator/Transcoders that have GPUs with varying capacities and be a more elegent solution than running multiple instances of Livepeer just to have a more ganular method of optimizing hardware.

Additional context

This can be expanded to also allow -perGpu settings for AI capable, prefer GPUs with newer NVENCs for higher quality output, to assign jobs to GPUs in order of preference i.e. -gpuPreference 2,0,1

Sep 20 '21 17:09 papabear99

I second this. The ability to define which GPU should attempt how many max sessions without having to invoke separate transcode only processes makes management easier.

Sep 27 '21 23:09 Strykar

@papabear99

Seems like there are two issues here:

set per GPU Max
specify GPU preference

is that right?

Seems like specifying GPU preference is a rabbit hole - what if you prefer one GPU for certain job types and another for other job types?

Jan 24 '22 19:01 hthillman

I see your point regarding setting preference for different tasks and I agree, however I would still find it beneficial to be able to set GPU preference for all jobs before sending jobs to lower priority GPUs for setups that have GPUs with different speeds.

i.e. I have 3 GPUs in my O/T 2x GTX 1070s and 1x RTX 4000. The GTX 1070s are ~33% faster than the RTX 4000 so I would prefer, and think it would benefit the network, if the 1070s were prioritized and only sent work to the RTX 4000 when the 1070s were at their specified capacity.

Jan 25 '22 00:01 papabear99

Seems like specifying GPU preference is a rabbit hole - what if you prefer one GPU for certain job types and another for other job types?

Why so? If Livepeer will have differing payloads in the future, say Inference / Tensor / ML, besides video transcoding, O's should be able to prioritize by GPU and payload type. Enabling freedom of choice for resource allocation to O's has no downsides.

Jan 25 '22 05:01 Strykar

If Livepeer will have differing payloads in the future, say Inference / Tensor / ML, besides video transcoding, O's should be able to prioritize by GPU and payload type.

yeah this is exactly my point. with multiple GPUs and multiple job types, it becomes pretty complex pretty rapidly. This is absolutely where we should be headed, it's just a nontrivial challenge

Mar 14 '22 19:03 hthillman

It's a good idea to have per-device session limits, -maxSessions attribute could be reused for that. It also feels like this feature would be most useful, when paired with ability to switch load balancing strategy from 'select least loaded device' to 'select devices in the order they are specified, unless at capacity' e.g.

livepeer ... -nvidia 0,1,2 -maxSessions 20,15,10 -lbMode load|priority

The downside is that one needs to know how many sessions each device could handle. It can be tested with livepeer_bench for standard transcoding, but would be much harder to properly estimate for custom capabilities.

There's already basic support on O for capacity and capability-based job routing, it just requires a bit more work on T side for per-device capability accounting and capability-based load balancing.

Jun 06 '22 10:06 cyberj0g

Sounds great!

Jun 06 '22 14:06 papabear99

Thanks @cyberj0g and @papabear99, added to the list of features to evaluate for Q3

Jun 07 '22 09:06 thomshutt

livepeer ... -nvidia 0,1,2 -maxSessions 20,15,10 -lbMode load|priority

This is a great idea! Running a bench per GPU is already what most people do to compare performance of each GPU anyway.

Jun 08 '22 03:06 Titan-Node