apex icon indicating copy to clipboard operation
apex copied to clipboard

Enabling Apex causes Pytorch autograd to crash

Open michaelklachko opened this issue 4 years ago • 6 comments

Installed apex with pip, tried running main_amp.py example. Getting this error: SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f29b2420bd0> returned NULL without setting an error

CUDA_VISIBLE_DEVICES=7,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=$RANDOM main_amp.py --data_dir /mnt/ssd2tb/imagenet -b 128


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


opt_level = None keep_batchnorm_fp32 = None <class 'NoneType'> loss_scale = None <class 'NoneType'>

CUDNN VERSION: 7603

opt_level = None keep_batchnorm_fp32 = None <class 'NoneType'> loss_scale = None <class 'NoneType'>

CUDNN VERSION: 7603

=> creating model 'resnet18' => creating model 'resnet18' main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) Traceback (most recent call last): File "main_amp.py", line 547, in main() File "main_amp.py", line 247, in main train(train_loader, model, criterion, optimizer, epoch) File "main_amp.py", line 357, in train loss.backward() File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7fbf43f9cbd0> returned NULL without setting an error Traceback (most recent call last): File "main_amp.py", line 547, in main() File "main_amp.py", line 247, in main train(train_loader, model, criterion, optimizer, epoch) File "main_amp.py", line 357, in train loss.backward() File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f29b2420bd0> returned NULL without setting an error Traceback (most recent call last): File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/mklachko/miniconda3/envs/pt/bin/python', '-u', 'main_amp.py', '--local_rank=1', '--data_dir', '/mnt/ssd2tb/imagenet', '-b', '128']' returned non-zero exit status 1.

(pt) mklachko@server:~/$ conda list apex packages in environment at /home/mklachko/miniconda3/envs/pt:

Name Version Build Channel apex 0.1 pypi_0 pypi

@mcarilli this happens regardless of the opt_level (the above trace was from not using any opt_level, just from using Apex version of DDP). The single GPU (non DDP) version of this script works fine. Original Pytorch DDP works fine (this version: https://github.com/pytorch/examples/blob/master/imagenet/main.py). I actually got this error only after updating Apex to the latest, the older version (~6 months old) worked fine, but I don't know which one, or how to downgrade. I think it might have been a version installed without cpp extensions.

Any ideas on what might be wrong here, and what can I do to get Apex to work on my machine (already tried reinstalling it a couple of times).

michaelklachko avatar Feb 09 '21 14:02 michaelklachko

I am facing same error in doing something similar

mishra011 avatar May 27 '21 09:05 mishra011

I am facing same error in doing something similar

Have you solve this problem?

Yep, I have fixed this by install pytorch from source image

v-nhandt21 avatar Jun 03 '21 09:06 v-nhandt21

Is there any other way other than build pytorch from source?

ZhiyuanChen avatar Jul 18 '21 16:07 ZhiyuanChen

I'm sure that non DDP version of apex works fine and pytorch DDP version works fine but DDP version of apex didn't work. So the problem maybe occurs in DDP of apex.

LZleejean avatar Aug 23 '21 11:08 LZleejean

Has anyone solved this issue? I also met this issue using Sagemaker container 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker

ZhangzihanGit avatar Mar 07 '22 23:03 ZhangzihanGit

Has anyone solved this issue? I also met this issue using Sagemaker container 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker

Apex is not maintained for new version of CUDA, I think we should move to use DDP of pytorch, it is much more easier.

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

v-nhandt21 avatar Mar 08 '22 02:03 v-nhandt21