serve icon indicating copy to clipboard operation
serve copied to clipboard

Low GPU utilization due to CPU-bound preprocessing

Open assapin opened this issue 1 year ago • 2 comments

I am running torchserve with batch size = 32 and delay = 30ms

My preprocessing is CPU bound and my inference is GPU bound. The GPU cannot start until the batch is ready on the CPU.

Currently, this leads to a serialized workflow where each stage blocks on the previous one:

  • Wait for batch to accumulate in the "front end"
  • preprocessing - CPU bound
  • inference - GPU bound

Problem

I am getting rather low GPU utilization This is because GPU is idle while batch is being prepared on the CPU.

What I tried

Running multiple workers - Helps, but limited by # of cores and GPU memory. Using threadpool for preprocessing - helps, but requires having at least 2-3X cores than workers to avoid contention

Question

How can I increase GPU utilization given that I need to wait for the pre-processing on the CPU? Any best practice or rules of thumb for this case?

Idea

Starting processing the batch as it's being built up on the frontend vs. idle until the entire batch is ready on the frontend:

  • Start accumulating a new batch
  • Immediately call handle() with a generator rather than wait for the batch to accumulate
  • Start preprocessing on the CPU from the generator (block as long as payloads are not yet available)
  • When generator is exhausted, pass the entire batch of tensors to the GPU and infer.

I don't know if this idea is possible without major changes in the core, but putting it out there..

assapin avatar Jan 20 '24 14:01 assapin

@assapin could you try 2 things?

  1. export OMP_NUM_THREADS=1
  2. apply microbatching

lxning avatar Jan 22 '24 21:01 lxning

Use GPU preprocess and pipeline parallelism if you can.

for instance, DALI, CVCUDA, tp

tp-nan avatar Jan 24 '24 05:01 tp-nan