ColossalAI
ColossalAI copied to clipboard
[WIP][Infer] Inference Distributed RPC Framework Optimization
- Optimize the data path: from
List->CPU Tensor->List->rpc_param->GPU TensortoList->rpc_param->GPU Tensor - Wrap the async forward only once
- Only rank0 Worker runs the sampler and returns the return value
- Pass the rpc param to worker 0 instead of all workers, and worker 0 broadcast the param to all workers using NCCL.
The performance is not good enough, which needs to be further optimized