transformer-deploy
transformer-deploy copied to clipboard
Two GPU are slower than one
Hi, I run Triton web server on two GPUs NVIDIA RTX3090Ti with --shm-size 20g
. When I do inference, I get time near 1.56s
. But if I run web server with only one GPU set --gpus '"device=0"'
after that I get the time near 860ms
.
Length of input sequence was 256 tokens. I've optimized GPT2-medium by your script.
convert_model -m gpt2-medium \
--backend tensorrt onnx \
--seq-len 32 512 512 \
--task text-generation --atol=2"