fastertransformer_backend
fastertransformer_backend copied to clipboard
Can't re-load any T5 model after a first load/unload iteration
Description
Branch: main
GPU: NVIDIA V100S
Docker version: 20.10.16
When re-loading a model that has been previously loaded and unloaded on time.
The backend crash with the following error :
I1019 08:53:38.952721 301 model_repository_manager.cc:997] loading: t5-small:1
*** The MPI_Init_thread() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[cf752fab27ae:00301] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Do you have any idea where does that come from?
Reproduced Steps
1. Start a triton server with `--model-control-mode=explicit`
2. Load a T5 model on the fastertransformer backend (i.e t5-small) `curl -vX POST localhost:8000/v2/repository/models/t5-small/load`
3. Unload the T5 model `curl -vX POST localhost:8000/v2/repository/models/t5-small/unload`
4. Re-load the T5 model `curl -vX POST localhost:8000/v2/repository/models/t5-small/load`
Because we need to run the program on multiple nodes, we integrate MPI into this backend. When you unload the model, the MPI_finalize is called and we cannot initialize it again. So, we cannot reload the model explicitly now.
Because we need to run the program on multiple nodes, we integrate MPI into this backend. When you unload the model, the MPI_finalize is called and we cannot initialize it again. So, we cannot reload the model explicitly now.
Thanks for your answer!
Two questions :
- Do you have any timeline to when you will handle loading/unloading? (we're heavily dependant of this feature)
- Would I be able to fix this (with a bit of your help) or does it require too much background knowledge?
- A workaround may be remove the MPI from the backend if you don't need it. It requires some modification for compiling, but don't need to modify the source codes of kernels.
- We are not exports of MPI. So, we don't have many idea for it.
Yea, was thinking of removing MPI from the backend. I will give it a try. (If it works it might be good to add the option to either build with it or not, and explain the pros/cons of it)
Keeping you up to date
Keeping you up to date
As the amount of code changes necessary to make it work is significant, we decided not to handle it. Thanks again for your answers and keep me up to date if you find any fix/work-around 🙂