llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

fix: Run prompt_guard model in a seperate thread

Open derekhiggins opened this issue 7 months ago • 1 comments

The GPU model usage blocks the CPU. Move it to its own thread. Also wrap in a lock to prevent multiple simultaneous run from exhausting the GPU.

Closes: #1746

What does this PR do?

Runs the prompt_guard model run in its own thread, protected by a lock

Test Plan

tested locally, running multiple runs simultaneous requests, inference was delayed while all safety shields were running with the patch inference is no longer delayed

derekhiggins avatar Mar 21 '25 14:03 derekhiggins