MS-AMP icon indicating copy to clipboard operation
MS-AMP copied to clipboard

MNIST single GPU example: GradScaler AssertionError

Open 152334H opened this issue 11 months ago • 5 comments

What's the issue, what's expected?: python mnist.py --enable-msamp --opt-level=O2 should work with the versions pinned in pyproject.toml. Specifically, it should work with torch==2.2.1, given that torch is unpinned.

How to reproduce it?: build MS-AMP with torch==2.2.1.

Log message or shapshot?:

$ python mnist.py --enable-msamp --opt-level=O2
[2024-03-05 14:56:15,819] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
msamp is enabled, opt_level: O2
Traceback (most recent call last):
  File "/home/a/MS-AMP/examples/mnist.py", line 185, in <module>
    main()
  File "/home/a/MS-AMP/examples/mnist.py", line 176, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/a/MS-AMP/examples/mnist.py", line 73, in train
    scaler.step(optimizer)
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 447, in step
    self.unscale_(optimizer)
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 255, in _unscale_grads_
    assert isinstance(param, torch.Tensor)
AssertionError

Additional information: This occurs because

  1. the isinstance check was introduced in this torch commit
  2. optimizer.param_groups[:,'params'] contains ScalingParameters
  3. ScalingParameters subclass ScalingTensor which subclasses nothing, so the isinstance check fails

Commenting out the assertion line manually fixes the issue. I do not know how to reasonably fix this without resorting to that.

152334H avatar Mar 05 '24 15:03 152334H

Can you share me the details reproduce steps? Seems pytorch 2.2 needs a higher version of NCCL and currently we only supports pytorch 2.1 and 1.4

tocean avatar Aug 14 '24 04:08 tocean

this one works https://github.com/Azure/MS-AMP/issues/178#issuecomment-2240362717

xrsrke avatar Aug 14 '24 11:08 xrsrke

I met the same problem. My torch version is 2.4.0 with CUDA 12.1:

  File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 182, in <module>
    main()
  File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 173, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 73, in train
    scaler.step(optimizer)
  File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 448, in step
    self.unscale_(optimizer)
  File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 256, in _unscale_grads_
    assert isinstance(param, torch.Tensor), f"param is not a Tensor: {type(param)}"
AssertionError: param is not a Tensor: <class 'msamp.nn.parameter.ScalingParameter'>

The param's type is ScalingParameter.

yatorho avatar Aug 15 '24 03:08 yatorho

Hi @yatorho , PyTorch added a new assertion to check whether param is torch.Tensor, but ScalingTensor in MS-AMP is not torch.Tensor.

A temporal solution is to comment the Line 256 in torch/amp/grad_scaler.py: assert isinstance(param, torch.Tensor), f"param is not a Tensor: {type(param)}".

wkcn avatar Aug 15 '24 03:08 wkcn

Thanks! it works for me.

yatorho avatar Aug 15 '24 03:08 yatorho