CTranslate2
CTranslate2 copied to clipboard
Ideas for better performance
Hello. So, I want to run NLLB-200 (3.3B) model on a server with 4x 3090, and a say, 16 core AMD Epyc cpu. I wrapped Ctranslate2 in fastAPI, running with uvicorn, inside a docker container with GPU support.
All code is here, feel free to do whatever with it: https://github.com/hobodrifterdavid/nllb-docker-rest
I want to handle requests with between 1 and 1000 sentences, with a reasonable balance between latency and throughput.
Here's a few things I did, from reading the documentation:
for ctranslate2.Translator:
device='auto', # May use CPU for very small translations? compute_type="float16", device_index=[0, 1, 2, 3]
for translator.translate_batch:
max_batch_size=256 # Bigger than this I get Cuda OOM errors.
I tried to use translate_batch with asynchronous=True, ~~but couldn't figure out easily how to await the results~~ (EDIT: figured it out, added results below)
uvicorn is run without the --workers flag, so, defaults to a python process, a single model loaded into GPU ram. FastAPI accepts up to 40 concurrent requests.
Anyway, I'll carry on trying to improve this setup, will post further results. If there are some suggestions for something I missed, it would be appreciated. Python is not my first langauge, please excuse naive errors.