DeepSpeed
DeepSpeed copied to clipboard
fix uneven issue & add balance autotp
This PR aims to balance the shard size of each worker as even as possible.
- We refactor the tp_shard logic that can make AutoTP work when split_shape % num_kv_heads != 0.
- When num_kv_heads is defined, the attention module relies on it to sharding, but the mlp and lm_head modules can use near even division to get more balance shard. It will get better performance.
Hi @RezaYazdaniAminabadi @delock. Could you please help review this PR? Thanks~
Hi @Yejing-Lai can you give some explaination on the need to have grainularity of 64 elements? https://github.com/microsoft/DeepSpeed/pull/4697/files#diff-214e32993d5440123080193836e988f024771aa4f6931c614ef9ad42a493f398R31
Hi @Yejing-Lai can you give some explaination on the need to have grainularity of 64 elements? https://github.com/microsoft/DeepSpeed/pull/4697/files#diff-214e32993d5440123080193836e988f024771aa4f6931c614ef9ad42a493f398R31
DNN library favors tensor size in granularity of power of 2, we pick 64 as a common granularity size.
Hi @RezaYazdaniAminabadi , FYI This PR improves AutoTP sharding when number of heads cannot be divided by number of ranks. MLP layers will have better load balance when running AutoTP on 3 devices or 3 CPU sub-NUMA clusters.
@Yejing-Lai, please help resolve conflict.
@Yejing-Lai, please help resolve conflict.
Hi @tjruwase. I resolved the conflict. Can you approve the workflows? Thanks~
Hi @tjruwase. The conflict had been resolved. Could you please help review this PR? Thanks~
Hi @tjruwase is this PR under review state or merge state? We are working on Intel Extension for PyTorch release and want to know whether this PR will be included in DeepSpeed next release. Thanks!
Hi @tjruwase is this PR under review state or merge state? We are working on Intel Extension for PyTorch release and want to know whether this PR will be included in DeepSpeed next release. Thanks!
@delock, sorry for the delay. This should be reviewed soon and will be included in next release.