onnxruntime_backend
onnxruntime_backend copied to clipboard
Half of CPU threads not utilized when running GPU model
Description
I noticed a pattern in CPU utilization when I ran the same GPU model on two VM:
- both with 1 T4 GPU, one with 16 cores and one with 8 cores.
- Standard_NC16as_T4_v3 and Standard_NC8as_T4_v3
- https://docs.microsoft.com/en-us/azure/virtual-machines/nct4-v3-series
When I run the same model, 16-core machine shows 9/16 cores used by Triton server (number of triton process is 10), but 8-core machine shows 5/8 cores used. At the same time, the speed I get from the 8-core machine is half that of 16-core.
This looks like programmatically Triton server is only able to utilize half of the CPU threads. This points to a CPU bottleneck because both set-ups have exactly the same GPU.
The speed on 16-core machine
Running 1m test @ http://127.0.0.1:5001/score
16 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 95.24ms 38.37ms 675.90ms 93.78%
Req/Sec 11.16 3.76 20.00 81.36%
10343 requests in 1.00m, 4.93MB read
Requests/sec: 172.12
Transfer/sec: 84.03KB
The speed on 8-core machine (notice the RPS reduced by half)
Running 1m test @ http://127.0.0.1:5001/score
16 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 212.91ms 69.24ms 918.24ms 78.50%
Req/Sec 5.35 2.35 20.00 79.50%
4534 requests in 1.00m, 2.16MB read
Requests/sec: 75.45
Transfer/sec: 36.84KB
Triton Information
- Triton version: 21.12
- using the triton container directly, not building ourselves
To Reproduce
The pattern is deterministically reproduced when sending the same queries to the GPU Triton model with ORT backend. When directly running the model in ORT, there is no such pattern.
worker_count=4
docker run -d -v $(pwd):/var/azureml-app --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p5001:5001 -p8002:8002 -e "OMP_WAIT_POLICY=PASSIVE" -e "AZUREML_MODEL_DIR=model_dir" -e "WORKER_COUNT=${worker_count}" -e "WORKER_PRELOAD=false" -e 'AZUREML_ENTRY_SCRIPT=fluency_score.py' -e "AZUREML_EXTRA_REQUIREMENTS_TXT=requirements.txt" shmaheacr.azurecr.io/tritonserver-inference:21.12-triton-flag
wrk -s ~/post_prepost_multi_input.lua -c 8 -t 8 -d 1m http://127.0.0.1:5001/score
Expected behavior No matter how many hardware CPU, the number of utilized cores should not differ.