server Raise exception when falling back to pinned memory

Raise exception when falling back to pinned memory

Open david-macleod opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe. Triton has a fallback mechanism for writing intermediates to pinned CPU memory when the CUDA memory pool is full. https://github.com/triton-inference-server/core/blob/main/src/memory.cc#L177

When using an ensemble model with large "intermediate" input/outputs, triggering this fallback can be catastrophic for performance, so we ensure enough memory CUDA memory is reserved upfront. Additionally for safety we also monitor the logs for the relevant warning to be raised if the fallback is triggered.

Describe the solution you'd like We would like the option for Triton server to raise an exception, rather than automatically falling back to the next level of the memory hierarchy, to avoid always having to wrap Triton server with log monitoring. This could potentially be a server CLI arg or an environment variable.

Describe alternatives you've considered Continue to monitor logs for the warning

Jun 21 '23 13:06 david-macleod

Thanks for the enhancement suggestion. I have filed a ticket for us to investigate further. DLIS-5052

Jun 21 '23 18:06 kthui

Is there any developments here?

If I was to contribute this change would it be considered? Would an environment variable or a CLI arg be more appropriate here for disabling pinned memory fallback?

Apr 19 '24 07:04 david-macleod

server server copied to clipboard

Raise exception when falling back to pinned memory

server
server copied to clipboard