fastertransformer_backend Questions about different intra-node settings for fastertransformer

Hi, I am wondering why in FasterTransformer, intra-node GPUs are bound to process-level, while in fastertransformer_backend, it is bound to thread-level? Since their src code are the same, why differs in intra-node binding?

Mar 31 '23 19:03 YJHMITWEB

multi-process is more flexible and stable because we can use it in multi-gpu and multi-node. But in triton server, we hope multiple model instances can share same model, and hence we need to use multi-thread.

Apr 03 '23 00:04 byshiue

Hi @byshiue ,

Thanks for the reply. So I am a little bit confused here. When enabling tensor parallelism, FasterTransformer expects it happens intra-node. So for example, each node has 2 GPUs, and we set the tensor_parallel=2, then when loading the model, the weights will be sliced into two parts, and each GPU loads one part. In this case, what do you mean by "we hope multiple model instances can share same model, and hence we need to use multi-thread." As in this case, each thread is responsible for different weights.

Is my understanding correct?

Apr 04 '23 13:04 YJHMITWEB

multi-instances is independent to tp. It is simpler to demonstrate on single gpu. Assume we have single gpu, we create a gpt model on this gpu, and then create 2 model instances based on the gpt model. Then, these two instances can handle different requests and share same weights.

Apr 06 '23 06:04 byshiue

Oh, I see. I totally get it, it basically serves for handling different requests. Thanks for the explanation!

Apr 06 '23 15:04 YJHMITWEB

Questions about different intra-node settings for fastertransformer_backend and FasterTransformer