Ma, Guokai comments

Results 180 comments of


                                            Ma, Guokai

Better core binding in torch.backends.xeon.run_cpu when launced from torchrun with --nproc-per-node

The failures seems not related to this PR, I'll merge again to see if it still exists. ``` File "C:\actions-runner\_work\pytorch\pytorch\test\test_cpp_extensions_open_device_registration.py", line 99, in test_base_device_registration x = torch.empty(4, 4, device=device) ModuleNotFoundError:...

Better core binding in torch.backends.xeon.run_cpu when launced from torchrun with --nproc-per-node

@pytorchbot merge

Enable Minigo for TensorFlow 2.0

2.0 is about to officially launch. Is there any change of status? What is the main obstacle for moving to TF2.0? I see absl version conflict might be a reason,...

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device()

I did a search for `current_device()` in DeepSpeed repo and it looks like most occurance of `current_device()` should be `current_device_name()` in order to be compatible to non-cuda device. Maybe increase...

fix uneven issue & add balance autotp

Hi @Yejing-Lai can you give some explaination on the need to have grainularity of 64 elements? https://github.com/microsoft/DeepSpeed/pull/4697/files#diff-214e32993d5440123080193836e988f024771aa4f6931c614ef9ad42a493f398R31

fix uneven issue & add balance autotp

Hi @RezaYazdaniAminabadi , FYI This PR improves AutoTP sharding when number of heads cannot be divided by number of ranks. MLP layers will have better load balance when running AutoTP...

fix uneven issue & add balance autotp

Hi @tjruwase is this PR under review state or merge state? We are working on Intel Extension for PyTorch release and want to know whether this PR will be included...

support autoTP with weight only quantization in DS inference path

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

support autoTP with weight only quantization in DS inference path

It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged....

support autoTP with weight only quantization in DS inference path

@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection?