llama-stack
llama-stack copied to clipboard
fix: Run prompt_guard model in a seperate thread
The GPU model usage blocks the CPU. Move it to its own thread. Also wrap in a lock to prevent multiple simultaneous run from exhausting the GPU.
Closes: #1746
What does this PR do?
Runs the prompt_guard model run in its own thread, protected by a lock
Test Plan
tested locally, running multiple runs simultaneous requests, inference was delayed while all safety shields were running with the patch inference is no longer delayed