♻️ replace deprecated functions for communication
This PR is for using the new communication functions
all_gather_into_tensor instead of _all_gather_base
reduce_scatter_tensor instead of _reduce_scatter_base
If these functions are not found (for older torch versions), we default to the deprecated functions.
@microsoft-github-policy-service agree
@jeffra @RezaYazdaniAminabadi @mrwyattii , I am unsure why the test is failing here.
The is CUDA OOM in nv-torch18-v100 / unit-tests
Doesn't seem related to the changes in this PR.
The test that is failing is test_moe_checkpoint
Thanks @tjruwase :) Want to get this in and start exploring the possibility of integrating torch.compile into deepspeed for accelerating training :)
if this one is a go please force a CI restart - HF side has been fixed.
Not sure why the CI is crashing @tjruwase . Doesn't seem to be a bug on my end. I will restart the CI build again once.
CI still seems to be broken :(
CI still seems to be broken :(
Sorry for delays, we're actively working on trying to unblock several CI related issues. Once this PR is merged we should be good to go: https://github.com/microsoft/DeepSpeed/pull/3047
@jeffra @tjruwase the tests are still crashing, seems unrelated to this PR. Is the transformers version correct?
E Traceback (most recent call last):
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/examples/pytorch/language-modeling/run_clm.py", line 635, in <module>
E main()
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/examples/pytorch/language-modeling/run_clm.py", line 412, in main
E model = AutoModelForCausalLM.from_pretrained(
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 470, in from_pretrained
E model_class = _get_model_class(config, cls._model_mapping)
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 360, in _get_model_class
E supported_models = model_mapping[type(config)]
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 602, in __getitem__
E return self._load_attr_from_module(model_type, model_name)
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 616, in _load_attr_from_module
E return getattribute_from_module(self._modules[module_name], attr)
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/models/auto/auto_factory.py", line 561, in getattribute_from_module
E if hasattr(module, attr):
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/utils/import_utils.py", line 1109, in __getattr__
E module = self._get_module(self._class_to_module[name])
E File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/transformers/src/transformers/utils/import_utils.py", line 1121, in _get_module
E raise RuntimeError(
E RuntimeError: Failed to import transformers.models.gptj.modeling_gptj because of the following error (look up to see its traceback):
E module 'torch' has no attribute 'fx'
Successful checks @jeffra @tjruwase
its broken again :|
yeah, looks like something broke in transformers again - looking
I'm able to reproduce once I install safetensors - it works without it - investigating. ok it has to do with the hub creating safetensor weights - not code related. reported - will resolve soon.
@jeffra, @tjruwase - I validated the transformers tests failure is unrelated to this feature. It should be safe to merge. @mayank31398 has been waiting for eons for this to be merged. Thank you!
Thanks Stas :)
and the transformers job should be fine now if someone can trigger the rebuild.
Can we prioritize this one? @tjruwase :)
Can we prioritize this one? @tjruwase :)
Yes, will merge once this CI run completes.
@mayank31398, I think the formatting issues can be fixe by upgrading pre-commit and clang-format
@mayank31398, I think the formatting issues can be fixe by upgrading pre-commit and clang-format
i am not seeing any issues with the formatting in the CI. are you suggesting to update the pre-commit nonetheless @tjruwase ?