onnxruntime_backend icon indicating copy to clipboard operation
onnxruntime_backend copied to clipboard

Half of CPU threads not utilized when running GPU model

Open wilsoncai1992 opened this issue 3 years ago • 0 comments

Description

I noticed a pattern in CPU utilization when I ran the same GPU model on two VM:

  • both with 1 T4 GPU, one with 16 cores and one with 8 cores.
  • Standard_NC16as_T4_v3 and Standard_NC8as_T4_v3
  • https://docs.microsoft.com/en-us/azure/virtual-machines/nct4-v3-series

When I run the same model, 16-core machine shows 9/16 cores used by Triton server (number of triton process is 10), but 8-core machine shows 5/8 cores used. At the same time, the speed I get from the 8-core machine is half that of 16-core.

This looks like programmatically Triton server is only able to utilize half of the CPU threads. This points to a CPU bottleneck because both set-ups have exactly the same GPU.

on 16-core machine

on 8-core machine

The speed on 16-core machine

Running 1m test @ http://127.0.0.1:5001/score
  16 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    95.24ms   38.37ms 675.90ms   93.78%
    Req/Sec    11.16      3.76    20.00     81.36%
  10343 requests in 1.00m, 4.93MB read
Requests/sec:    172.12
Transfer/sec:     84.03KB

The speed on 8-core machine (notice the RPS reduced by half)

Running 1m test @ http://127.0.0.1:5001/score
  16 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   212.91ms   69.24ms 918.24ms   78.50%
    Req/Sec     5.35      2.35    20.00     79.50%
  4534 requests in 1.00m, 2.16MB read
Requests/sec:     75.45
Transfer/sec:     36.84KB

Triton Information

  1. Triton version: 21.12
  2. using the triton container directly, not building ourselves

To Reproduce

The pattern is deterministically reproduced when sending the same queries to the GPU Triton model with ORT backend. When directly running the model in ORT, there is no such pattern.

worker_count=4
docker run -d -v $(pwd):/var/azureml-app --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p5001:5001 -p8002:8002 -e "OMP_WAIT_POLICY=PASSIVE" -e "AZUREML_MODEL_DIR=model_dir" -e "WORKER_COUNT=${worker_count}" -e "WORKER_PRELOAD=false" -e 'AZUREML_ENTRY_SCRIPT=fluency_score.py' -e "AZUREML_EXTRA_REQUIREMENTS_TXT=requirements.txt" shmaheacr.azurecr.io/tritonserver-inference:21.12-triton-flag

wrk -s ~/post_prepost_multi_input.lua -c 8 -t 8 -d 1m http://127.0.0.1:5001/score

Expected behavior No matter how many hardware CPU, the number of utilized cores should not differ.

wilsoncai1992 avatar Jan 26 '22 18:01 wilsoncai1992