MS-AMP
MS-AMP copied to clipboard
MNIST single GPU example: GradScaler AssertionError
What's the issue, what's expected?:
python mnist.py --enable-msamp --opt-level=O2
should work with the versions pinned in pyproject.toml
. Specifically, it should work with torch==2.2.1
, given that torch is unpinned.
How to reproduce it?:
build MS-AMP with torch==2.2.1
.
Log message or shapshot?:
$ python mnist.py --enable-msamp --opt-level=O2
[2024-03-05 14:56:15,819] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
msamp is enabled, opt_level: O2
Traceback (most recent call last):
File "/home/a/MS-AMP/examples/mnist.py", line 185, in <module>
main()
File "/home/a/MS-AMP/examples/mnist.py", line 176, in main
train(args, model, device, train_loader, optimizer, epoch)
File "/home/a/MS-AMP/examples/mnist.py", line 73, in train
scaler.step(optimizer)
File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 447, in step
self.unscale_(optimizer)
File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 255, in _unscale_grads_
assert isinstance(param, torch.Tensor)
AssertionError
Additional information: This occurs because
- the isinstance check was introduced in this torch commit
-
optimizer.param_groups[:,'params']
containsScalingParameter
s -
ScalingParameter
s subclassScalingTensor
which subclasses nothing, so theisinstance
check fails
Commenting out the assertion line manually fixes the issue. I do not know how to reasonably fix this without resorting to that.
Can you share me the details reproduce steps? Seems pytorch 2.2 needs a higher version of NCCL and currently we only supports pytorch 2.1 and 1.4
this one works https://github.com/Azure/MS-AMP/issues/178#issuecomment-2240362717
I met the same problem. My torch version is 2.4.0 with CUDA 12.1:
File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 182, in <module>
main()
File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 173, in main
train(args, model, device, train_loader, optimizer, epoch)
File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 73, in train
scaler.step(optimizer)
File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 448, in step
self.unscale_(optimizer)
File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
^^^^^^^^^^^^^^^^^^^^^
File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 256, in _unscale_grads_
assert isinstance(param, torch.Tensor), f"param is not a Tensor: {type(param)}"
AssertionError: param is not a Tensor: <class 'msamp.nn.parameter.ScalingParameter'>
The param
's type is ScalingParameter
.
Hi @yatorho , PyTorch added a new assertion to check whether param is torch.Tensor, but ScalingTensor in MS-AMP is not torch.Tensor.
A temporal solution is to comment the Line 256 in torch/amp/grad_scaler.py
: assert isinstance(param, torch.Tensor), f"param is not a Tensor: {type(param)}"
.
Thanks! it works for me.