DeepSpeed ♻️ replace deprecated functions for communication

This PR is for using the new communication functions all_gather_into_tensor instead of _all_gather_base reduce_scatter_tensor instead of _reduce_scatter_base

If these functions are not found (for older torch versions), we default to the deprecated functions.

Mar 10 '23 15:03 mayank31398

@microsoft-github-policy-service agree

Mar 10 '23 15:03 mayank31398

@jeffra @RezaYazdaniAminabadi @mrwyattii , I am unsure why the test is failing here. The is CUDA OOM in nv-torch18-v100 / unit-tests Doesn't seem related to the changes in this PR.

The test that is failing is test_moe_checkpoint

Mar 10 '23 16:03 mayank31398

Thanks @tjruwase :) Want to get this in and start exploring the possibility of integrating torch.compile into deepspeed for accelerating training :)

Mar 13 '23 14:03 mayank31398

if this one is a go please force a CI restart - HF side has been fixed.

Mar 15 '23 18:03 stas00

Not sure why the CI is crashing @tjruwase . Doesn't seem to be a bug on my end. I will restart the CI build again once.

Mar 16 '23 16:03 mayank31398

CI still seems to be broken :(

Mar 17 '23 21:03 mayank31398

CI still seems to be broken :(

Sorry for delays, we're actively working on trying to unblock several CI related issues. Once this PR is merged we should be good to go: https://github.com/microsoft/DeepSpeed/pull/3047

Mar 17 '23 21:03 jeffra

@jeffra @tjruwase the tests are still crashing, seems unrelated to this PR. Is the transformers version correct?

E           Traceback (most recent call last):
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/examples/pytorch/language-modeling/run_clm.py", line 635, in <module>
E               main()
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/examples/pytorch/language-modeling/run_clm.py", line 412, in main
E               model = AutoModelForCausalLM.from_pretrained(
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 470, in from_pretrained
E               model_class = _get_model_class(config, cls._model_mapping)
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 360, in _get_model_class
E               supported_models = model_mapping[type(config)]
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 602, in __getitem__
E               return self._load_attr_from_module(model_type, model_name)
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 616, in _load_attr_from_module
E               return getattribute_from_module(self._modules[module_name], attr)
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 561, in getattribute_from_module
E               if hasattr(module, attr):
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/utils/import_utils.py", line 1109, in __getattr__
E               module = self._get_module(self._class_to_module[name])
E             File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/utils/import_utils.py", line 1121, in _get_module
E               raise RuntimeError(
E           RuntimeError: Failed to import transformers.models.gptj.modeling_gptj because of the following error (look up to see its traceback):
E           module 'torch' has no attribute 'fx'

Mar 22 '23 13:03 mayank31398

Successful checks @jeffra @tjruwase

Mar 24 '23 12:03 mayank31398

its broken again :|

Mar 24 '23 18:03 mayank31398

yeah, looks like something broke in transformers again - looking

Mar 24 '23 18:03 stas00

I'm able to reproduce once I install safetensors - it works without it - investigating. ok it has to do with the hub creating safetensor weights - not code related. reported - will resolve soon.

Mar 24 '23 18:03 stas00

@jeffra, @tjruwase - I validated the transformers tests failure is unrelated to this feature. It should be safe to merge. @mayank31398 has been waiting for eons for this to be merged. Thank you!

Mar 24 '23 19:03 stas00

Thanks Stas :)

Mar 24 '23 19:03 mayank31398

and the transformers job should be fine now if someone can trigger the rebuild.

Mar 24 '23 20:03 stas00

Can we prioritize this one? @tjruwase :)

Mar 29 '23 19:03 mayank31398

Can we prioritize this one? @tjruwase :)

Yes, will merge once this CI run completes.

Mar 29 '23 19:03 tjruwase

@mayank31398, I think the formatting issues can be fixe by upgrading pre-commit and clang-format

Mar 29 '23 19:03 tjruwase

@mayank31398, I think the formatting issues can be fixe by upgrading pre-commit and clang-format

i am not seeing any issues with the formatting in the CI. are you suggesting to update the pre-commit nonetheless @tjruwase ?

Mar 29 '23 20:03 mayank31398