MS-AMP
MS-AMP copied to clipboard
AttributeError: 'ScalingTensor' object has no attribute 'view'
What's the issue, what's expected?: Error when using ms-amp to do llm sft. ms-amp deepspeed config: "msamp": { "enabled": true, "opt_level": "O1|O2|O3", # all tried "use_te": false }
How to reproduce it?: Follow the setup of DeepSpeed-Chat, and do some small code modify to enable ms-amp in DeepSpeed-Chat/training/step1_supervised_finetuning/main.py:
line 20 modify: import deepspeed -> from msamp import deepspeed
line 230 add: ds_config["msamp"] = { "enabled": True, "opt_level": "O1|O2|O3", "use_te": False }
Log message or shapshot?:
Traceback (most recent call last):
File "/home/work/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 400, in <module>
main()
File "/home/work/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 369, in main
model.backward(loss)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/engine.py", line 405, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 951, in backward
super().backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 491, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/usr/local/lib/python3.10/dist-packages/msamp/nn/functional.py", line 123, in backward
ctx.weight.backward_grad_update(wgrad)
File "/usr/local/lib/python3.10/dist-packages/msamp/common/tensor/tensor.py", line 130, in backward_grad_update
self._backward_post_hooks(grad)
File "/usr/local/lib/python3.10/dist-packages/msamp/common/tensor/hook.py", line 47, in __call__
hook(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1581, in _call_impl
hook_result = hook(self, args, result)
File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 386, in reduce_partition_and_remove_grads
self.fp8_reduce_ready_partitions_and_remove_grads(param, i)
File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 595, in fp8_reduce_ready_partitions_and_remove_grads
self.fp8_reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 412, in fp8_reduce_independent_p_g_buckets_and_remove_grads
self.fp8_reduce_ipg_grads()
File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 541, in fp8_reduce_ipg_grads
self.fp8_average_tensor(self.fp8_extra_large_param_to_reduce.grad.view(-1))
AttributeError: 'ScalingTensor' object has no attribute 'view'
Additional information: env: ghcr.io/azure/msamp:v0.4.0-cuda12.2 gpu: h100 * 8