Ma, Guokai
Ma, Guokai
The failures seems not related to this PR, I'll merge again to see if it still exists. ``` File "C:\actions-runner\_work\pytorch\pytorch\test\test_cpp_extensions_open_device_registration.py", line 99, in test_base_device_registration x = torch.empty(4, 4, device=device) ModuleNotFoundError:...
@pytorchbot merge
2.0 is about to officially launch. Is there any change of status? What is the main obstacle for moving to TF2.0? I see absl version conflict might be a reason,...
I did a search for `current_device()` in DeepSpeed repo and it looks like most occurance of `current_device()` should be `current_device_name()` in order to be compatible to non-cuda device. Maybe increase...
Hi @Yejing-Lai can you give some explaination on the need to have grainularity of 64 elements? https://github.com/microsoft/DeepSpeed/pull/4697/files#diff-214e32993d5440123080193836e988f024771aa4f6931c614ef9ad42a493f398R31
Hi @RezaYazdaniAminabadi , FYI This PR improves AutoTP sharding when number of heads cannot be divided by number of ranks. MLP layers will have better load balance when running AutoTP...
Hi @tjruwase is this PR under review state or merge state? We are working on Intel Extension for PyTorch release and want to know whether this PR will be included...
@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?
It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged....
@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection?