rl icon indicating copy to clipboard operation
rl copied to clipboard

[BUG] CUDA ``autocast`` bug

Open FrankTianTT opened this issue 2 years ago • 0 comments

Describe the bug

When using `autocast`` in dreamer example, there is a RuntimeError:

RuntimeError: masked_scatter_: expected self and source to have same dtypes but gotHalf and Float

Unfortunately, its seem to be a bug of PyTorch itself. (https://github.com/pytorch/pytorch/issues/81876)

To Reproduce

Run dreamer example.

Whole output:

collector: MultiaSyncDataCollector()                                                                                                                               
init seed: 42, final seed: 971637020                                                                                                                               
  7%|████████▌                                                                                                            | 36800/500000 [20:23<4:48:42, 26.74it/s]
Error executing job with overrides: []                                                                                                                             
Traceback (most recent call last):                                                                                                                                 
  File "/home/frank/Projects/rl_dev/examples/dreamer/dreamer.py", line 359, in main                                                                                
    scaler2.scale(actor_loss_td["loss_actor"]).backward()                                                                                                          
  File "/home/frank/anaconda3/envs/rl_dev/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward                                                     
    torch.autograd.backward(                                                                                                                                       
  File "/home/frank/anaconda3/envs/rl_dev/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward                                           
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                 
RuntimeError: masked_scatter_: expected self and source to have same dtypes but gotHalf and Float 

System info

Describe the characteristic of your environment:

  • Describe how the library was installed (pip, source, ...)
  • Python version
  • Versions of any other relevant libraries
import torchrl, numpy, sys
print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
0.2.1 1.26.1 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0] linux

Reason and Possible fixes

Maybe we should disable autocast until this bug is fixed by torch?

Checklist

  • [x] I have checked that there is no similar issue in the repo (required)
  • [x] I have read the documentation (required)
  • [x] I have provided a minimal working example to reproduce the bug (required)

FrankTianTT avatar Nov 19 '23 03:11 FrankTianTT