serve
serve copied to clipboard
Low GPU utilization due to CPU-bound preprocessing
I am running torchserve with batch size = 32 and delay = 30ms
My preprocessing is CPU bound and my inference is GPU bound. The GPU cannot start until the batch is ready on the CPU.
Currently, this leads to a serialized workflow where each stage blocks on the previous one:
- Wait for batch to accumulate in the "front end"
- preprocessing - CPU bound
- inference - GPU bound
Problem
I am getting rather low GPU utilization This is because GPU is idle while batch is being prepared on the CPU.
What I tried
Running multiple workers - Helps, but limited by # of cores and GPU memory. Using threadpool for preprocessing - helps, but requires having at least 2-3X cores than workers to avoid contention
Question
How can I increase GPU utilization given that I need to wait for the pre-processing on the CPU? Any best practice or rules of thumb for this case?
Idea
Starting processing the batch as it's being built up on the frontend vs. idle until the entire batch is ready on the frontend:
- Start accumulating a new batch
- Immediately call handle() with a generator rather than wait for the batch to accumulate
- Start preprocessing on the CPU from the generator (block as long as payloads are not yet available)
- When generator is exhausted, pass the entire batch of tensors to the GPU and infer.
I don't know if this idea is possible without major changes in the core, but putting it out there..