fix: Run prompt_guard model in a seperate thread

Open derekhiggins opened this issue 7 months ago • 1 comments

The GPU model usage blocks the CPU. Move it to its own thread. Also wrap in a lock to prevent multiple simultaneous run from exhausting the GPU.

Closes: #1746

What does this PR do?

Runs the prompt_guard model run in its own thread, protected by a lock

Test Plan

tested locally, running multiple runs simultaneous requests, inference was delayed while all safety shields were running with the patch inference is no longer delayed

Mar 21 '25 14:03 derekhiggins

llama-stack llama-stack copied to clipboard

fix: Run prompt_guard model in a seperate thread

What does this PR do?

Test Plan

llama-stack
llama-stack copied to clipboard