fastertransformer_backend icon indicating copy to clipboard operation
fastertransformer_backend copied to clipboard

Questions about different intra-node settings for fastertransformer_backend and FasterTransformer

Open YJHMITWEB opened this issue 1 year ago • 4 comments

Hi, I am wondering why in FasterTransformer, intra-node GPUs are bound to process-level, while in fastertransformer_backend, it is bound to thread-level? Since their src code are the same, why differs in intra-node binding?

YJHMITWEB avatar Mar 31 '23 19:03 YJHMITWEB

multi-process is more flexible and stable because we can use it in multi-gpu and multi-node. But in triton server, we hope multiple model instances can share same model, and hence we need to use multi-thread.

byshiue avatar Apr 03 '23 00:04 byshiue

Hi @byshiue ,

Thanks for the reply. So I am a little bit confused here. When enabling tensor parallelism, FasterTransformer expects it happens intra-node. So for example, each node has 2 GPUs, and we set the tensor_parallel=2, then when loading the model, the weights will be sliced into two parts, and each GPU loads one part. In this case, what do you mean by "we hope multiple model instances can share same model, and hence we need to use multi-thread." As in this case, each thread is responsible for different weights.

Is my understanding correct?

YJHMITWEB avatar Apr 04 '23 13:04 YJHMITWEB

multi-instances is independent to tp. It is simpler to demonstrate on single gpu. Assume we have single gpu, we create a gpt model on this gpu, and then create 2 model instances based on the gpt model. Then, these two instances can handle different requests and share same weights.

byshiue avatar Apr 06 '23 06:04 byshiue

Oh, I see. I totally get it, it basically serves for handling different requests. Thanks for the explanation!

YJHMITWEB avatar Apr 06 '23 15:04 YJHMITWEB