fastertransformer_backend Can't re-load any T5 model after a first load/unload iteration

Description

Branch: main
GPU: NVIDIA V100S
Docker version: 20.10.16

When re-loading a model that has been previously loaded and unloaded on time.
The backend crash with the following error :

I1019 08:53:38.952721 301 model_repository_manager.cc:997] loading: t5-small:1
*** The MPI_Init_thread() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[cf752fab27ae:00301] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Do you have any idea where does that come from?

Reproduced Steps

1. Start a triton server with `--model-control-mode=explicit`
2. Load a T5 model on the fastertransformer backend (i.e t5-small) `curl -vX POST localhost:8000/v2/repository/models/t5-small/load`
3. Unload the T5 model `curl -vX POST localhost:8000/v2/repository/models/t5-small/unload`
4. Re-load the T5 model `curl -vX POST localhost:8000/v2/repository/models/t5-small/load`

Oct 19 '22 09:10 Thytu

Because we need to run the program on multiple nodes, we integrate MPI into this backend. When you unload the model, the MPI_finalize is called and we cannot initialize it again. So, we cannot reload the model explicitly now.

Oct 19 '22 09:10 byshiue

Because we need to run the program on multiple nodes, we integrate MPI into this backend. When you unload the model, the MPI_finalize is called and we cannot initialize it again. So, we cannot reload the model explicitly now.

Thanks for your answer!

Two questions :

Do you have any timeline to when you will handle loading/unloading? (we're heavily dependant of this feature)
Would I be able to fix this (with a bit of your help) or does it require too much background knowledge?

Oct 19 '22 09:10 Thytu

A workaround may be remove the MPI from the backend if you don't need it. It requires some modification for compiling, but don't need to modify the source codes of kernels.
We are not exports of MPI. So, we don't have many idea for it.

Oct 19 '22 09:10 byshiue

Yea, was thinking of removing MPI from the backend. I will give it a try. (If it works it might be good to add the option to either build with it or not, and explain the pros/cons of it)

Keeping you up to date

Oct 19 '22 09:10 Thytu

Keeping you up to date

As the amount of code changes necessary to make it work is significant, we decided not to handle it. Thanks again for your answers and keep me up to date if you find any fix/work-around 🙂

Oct 19 '22 12:10 Thytu

fastertransformer_backend fastertransformer_backend copied to clipboard

Can't re-load any T5 model after a first load/unload iteration

Description

Reproduced Steps

fastertransformer_backend
fastertransformer_backend copied to clipboard