Drogon-torch-serve
Drogon-torch-serve copied to clipboard
Serve pytorch / torch models using Drogon
C++ Torch Server
Serve torch models as rest-api using Drogon, example included for resnet18 model for Imagenet. Benchmarks show improvement of ~6-10x throughput and latencies for resnet18 at peak load.
Build & Run Instructions
# Create Optimized models for your machine.
$ python3 optimize_model_for_inference.py
# Build and Run Server
$ docker compose run --service-ports blaze
Development
- Add Docker to CLion toolchain this will setup all necessary dependencies.
Client Instructions
curl "localhost:8088/classify" -F "image=@images/cat.jpg"
Benchmarking Instructions
# Drogon + libtorch
for i in {0..8}; do curl "localhost:8088/classify" -F "image=@images/cat.jpg"; done # Run once to warmup.
wrk -t8 -c100 -d60 -s benchmark/upload.lua "http://localhost:8088/classify" --latency
# FastAPI + pytorch
cd benchmark/python_fastapi
python3 -m venv env
source env/bin/activate
python3 -m pip install -r requirements.txt # Run just once to isntall dependencies to folder.
gunicorn main:app -w 2 -k uvicorn.workers.UvicornWorker --bind 127.0.0.1: # Best performance on my machine, tried 3/4 also.
deactivate # Use after benchmarking is done and gunicorn is closed
cd ../.. # back to root folder
for i in {0..8}; do curl "localhost:8088/classify" -F "image=@images/cat.jpg"; done
wrk -t8 -c100 -d60 -s benchmark/fastapi_upload.lua "http://localhost:8088/classify" --latency
Benchmarking results
Drogon + libtorch
# OS: Ubuntu 21.10 x86_64
# Kernel: 5.15.14-xanmod1
# CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz
# GPU: NVIDIA GeForce RTX 3070
Running 1m test @ http://localhost:8088/classify
8 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 39.30ms 10.96ms 95.51ms 70.50%
Req/Sec 306.58 28.78 390.00 70.92%
Latency Distribution
50% 37.40ms
75% 45.69ms
90% 54.57ms
99% 69.34ms
146612 requests in 1.00m, 30.34MB read
Requests/sec: 2441.60
Transfer/sec: 517.41KB
FastAPI + pytorch
# OS: Ubuntu 21.10 x86_64
# Kernel: 5.15.14-xanmod1
# CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz
# GPU: NVIDIA GeForce RTX 3070
Running 1m test @ http://localhost:8088/classify
8 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 449.50ms 239.30ms 1.64s 70.39%
Req/Sec 33.97 26.41 121.00 83.46%
Latency Distribution
50% 454.64ms
75% 570.73ms
90% 743.54ms
99% 1.16s
12981 requests in 1.00m, 2.64MB read
Requests/sec: 216.13
Transfer/sec: 44.96KB
Architecture
- API request handing and model Pre-processing in the Drogon Controller
controllers/ImageClass.cc
- Batched Model Inference logic & post-processing in
lib/ModelBatchInference.cpp
TODOS
- [x] Multithreaded batched inference
- [x] FP16 Inference
- [x] Uses c++20 coroutines for wait free event loop tasks
- [x] Add compiler optimizations for cmake.
- [x] Benchmark optimizations like Channel last, ONNX, TensorRT and report what's faster.
- [x] ~~Pin Batched tensor used for inference to memory and re-use at every inference.~~ No Improvement.
- [ ] User Torch-TensorRT for inference, fastest on CUDA devices. Cuts down from 5ms to 1-2ms .
- [ ] Use Torch Nvjpeg for faster image decoding, currently spends 2ms on this call with libjpeg-turbo.
- [ ] Int8 Inference using FXGraph post-training quantization, Resnet Int8 Quantization example1 , example2
- [ ] Benchmark framework against mosec
- [ ] Use lockfree queues
- [ ] Seperate Pre-Process, Infer and post-preprocessing.
- [x] Added address & memory leak sanitizers to CMake.
- [x] Dockerize for easy usage.
Notes
- WIP: Just gets the job done for now, not production ready, though tested regularly.