Nanoflow
Nanoflow copied to clipboard
Does this method have the same benefit when tp=1 or tp=2?
Does this method have the same benefit when tp=1 or tp=2?
When TP is small, almost all the available GPU memory is occupied by model weights. Therefore, the request batch size is reduced, and thus, the batching effect is less significant. Therefore, reducing TP would greatly harm the system's performance.