triton-grpc-proxy-rs
                                
                                
                                
                                    triton-grpc-proxy-rs copied to clipboard
                            
                            
                            
                        Proxy server for triton gRPC server that inferences embedding model in Rust
triton-grpc-proxy-rs
Proxy server for triton gRPC server that inferences embedding model in Rust.
- it refines the request and response formats of the Triton server.
 - without 
tritonclientdependency. - fast & easy to use.
 
Build
1. Convert the embedding model to onnx
BAAI/bge-large-en-v1.5is used for an example.- It'll convert Pytorch into onnx model, and save it to 
./model_repository/embedding/1/v1.onnx. - Currently, 
max_batch_sizeis limited to256due to OOM. You can change this value to fit your environment. 
python3 convert.py
2. Run docker-compose
- I'll run both Triton inference server and the proxy server.
 - You need to edit the absolute path of the volume (where pointed to the 
./model_repository) indocker-compose.yml. 
make run-docker-compose
Build & run a proxy server only
- You can also build and run a triton proxy server with the below command.
 
export RUSTFLAGS="-C target-feature=native"
make server
make build-docker
Build & run triton inference server only
docker run --gpus all --rm --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)triton-grpc-proxy-rs/model_repository:/models nvcr.io/nvidia/tritonserver:23.09-py3 bash -c "LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libtcmalloc.so.4:${LD_PRELOAD} && pip install transformers tokenizers && tritonserver --model-repository=/models"
Architecture
- recieve request(s) from the user.
- list of 
text (String)in this case. 
 - list of 
 - request the Triton gRPC server to get embeddings.
 - post-process (cast and reshape) the embeddings and returns to the users.
 
API Specs
Configs
- 
parse configuration from the env variables.
 - 
SERVER_PORT: proxy server port. default8080. - 
TRITON_SERVER_URL: triton inference gRPC server url. defaulthttp://triton-server. - 
TRITON_SERVER_GRPC_PORT: triton inference gRPC server port. default8001. - 
MODEL_VERSION: model version. default1. - 
MODEL_NAME: model name. defaultmodel. - 
INPUT_NAME: input name. defaulttext. - 
OUTPUT_NAME: output name. defaultembedding. - 
EMBEDDING_SIZE: size of the embedding. default2048. 
health
- GET 
/health 
curl -i http://127.0.0.1:8080/health
HTTP/1.1 200 OK
content-length: 2
date: Sun, 08 Oct 2023 06:33:53 GMT
ok
embedding
- POST 
/v1/embedding - Request Body : 
[{'query': 'input'}, ... ] 
curl -H "Content-type:application/json" -X POST http://127.0.0.1:8080/v1/embedding -d "[{\"query\": \"asdf\"}, {\"query\": \"asdf asdf\"}, {\"query\": \"asdf asdf asdf\"}, {\"query\": \"asdf asdf asdf asdf\"}]"
- Response Body : 
[{'embedding': '1024 f32 vector'}, ...] 
[{"embedding": [-0.8067292,-0.004603,-0.24123234,0.59398544,-0.5583446,...]}, ...]
Benchmark
- Environment
- CPU : i7-7700K (not overclocked)
 - GPU : GTX 1060 6 GB
 - Rust : v1.73.0 stable
 - Triton Server : 
23-09-py3- backend : onnxruntime-gpu
 - allocator : tcmalloc
 
 
 - payload : 
[{'query': 'asdf' * 125}] * batch_size - stages
- request : end-to-end latency (client-side)
 - model : only triton gRPC server latency (preprocess + tokenize + model)
 - processing : request - model latency
- json de/serialization
 - serialization (byte string, float vector)
 - cast & reshape 2d vectors
 
 
 
| batch size | request | model | processing | 
|---|---|---|---|
| 8 | 27.2 ms | 25.4 ms | 1.8 ms | 
| 16 | 36.0 ms | 33.7 ms | 2.3 ms | 
| 32 | 50.6 ms | 47.3 ms | 3.3 ms | 
| 64 | 90.9 ms | 85.5 ms | 5.4 ms | 
| 128 | 139.2 ms | 129.9 ms | 9.3 ms | 
| 256 | 307.4 ms | 287.1 ms | 20.3 ms | 
To-Do
- [x] add 
Dockerfileanddocker-composeto easily deploy the servers - [x] triton inference server
- [x] add model converter script.
 - [x] configurations
 
 - [x] move hard-coded configs to 
env - [x] optimize the proxy server performance
 - [x] README
 - [ ] move 
tokenizerpart from triton server intoproxy-server