serve
serve copied to clipboard
huge performance gap on intel mac metal vs docker
🐛 Describe the bug
The results obtained by running in different environments vary greatly
- run in macbook pro 2019 metal
- run in the same metal, but use docker 20.10.17
running command is exactly the same as the mar file, but serve TPS differs greatly
run in metal
torchserve --start --model-store model-store --models all
wrk result
wrk -c 100 -t 6 -s content-1.lua --latency http://127.0.0.1:8080/predictions/bert -d 10
Running 10s test @ http://127.0.0.1:8080/predictions/bert
6 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 204.10ms 166.13ms 1.21s 93.10%
Req/Sec 98.43 30.03 151.00 72.46%
Latency Distribution
50% 154.06ms
75% 168.14ms
90% 298.35ms
99% 1.07s
5415 requests in 10.10s, 3.56MB read
Requests/sec: 536.39
Transfer/sec: 361.44KB
run in docker
docker run --rm -it -p 8080:8080 -p 8081:8081 \
-p 8082:8082 -p 7070:7070 -p 7071:7071 \
--name bert \
--entrypoint=bash \
-v $(pwd)/model-store:/home/model-server/model-store \
pytorch/torchserve:0.6.0-cpu
and in docker exec the same run cmd
torchserve --start --model-store model-store --models all
wrk result
wrk -c 100 -t 6 -s content-1.lua --latency http://127.0.0.1:8080/predictions/bert -d 10
Running 10s test @ http://127.0.0.1:8080/predictions/bert
6 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.42s 496.67ms 1.97s 75.00%
Req/Sec 2.62 4.42 20.00 86.00%
Latency Distribution
50% 1.84s
75% 1.89s
90% 1.96s
99% 1.97s
62 requests in 10.06s, 41.73KB read
Socket errors: connect 0, read 0, write 0, timeout 50
Requests/sec: 6.16
Transfer/sec: 4.15KB
Error logs
No error
but TPS run in metal 500/s, run in docker 7/s
Installation instructions
metal
install from source v0.6.0
docker
pytorch/torchserve:0.6.0-cpu
Model Packaing
torch-model-archiver --force --model-name bert \
--version 1.0.0 \
--serialized-file models/bert_model.pt \
--extra-files ./bert_record.py,./models/content_id_map.json \
--handler bert_handler.py
config.properties
#Saving snapshot #Wed Aug 10 10:05:54 CST 2022 inference_address=http://0.0.0.0:8080 load_models=all model_store=model-store async_logging=true number_of_gpu=0 job_queue_size=1000 python=/Users/eric/code/re/re-serving/venv/bin/python model_snapshot={\n "name": "20220810100554930-startup.cfg",\n "modelCount": 1,\n "created": 1660097154931,\n "models": {\n "bert": {\n "1.0.0": {\n "defaultVersion": true,\n "marName": "bert.mar",\n "minWorkers": 12,\n "maxWorkers": 12,\n "batchSize": 1,\n "maxBatchDelay": 100,\n "responseTimeout": 120\n }\n }\n }\n} tsConfigFile=logs/config/20220810100550643-shutdown.cfg version=0.6.0 workflow_store=model-store number_of_netty_threads=32 management_address=http://0.0.0.0:8081 metrics_address=http://0.0.0.0:8082
Versions
torch==1.12.0 torch-model-archiver==0.6.0 torch-workflow-archiver==0.2.4 torchserve==0.6.0
Repro instructions
the same mar model,run in diff env
wrk preformance
wrk -c 100 -t 6 -s content-1.lua --latency http://127.0.0.1:8080/predictions/bert -d 10
Possible Solution
No response
@lxning help
@yayuntian Can you provide the reproduction steps, particularly regarding the model --serialized-file models/bert_model.pt --extra-files ./bert_record.py,./models/content_id_map.json
. I want to get a breakdown of where the latency is added, is it just the model or also the server frontend
@yayuntian This is not a pytorch issue. On MacOS Docker uses QEMU under the hood to simulate linux/aarch64
thus why the drop in performance. The only way around that is to not use Docker.