apex
apex copied to clipboard
Enabling Apex causes Pytorch autograd to crash
Installed apex with pip, tried running main_amp.py example. Getting this error:
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f29b2420bd0> returned NULL without setting an error
CUDA_VISIBLE_DEVICES=7,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=$RANDOM main_amp.py --data_dir /mnt/ssd2tb/imagenet -b 128
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
opt_level = None keep_batchnorm_fp32 = None <class 'NoneType'> loss_scale = None <class 'NoneType'>
CUDNN VERSION: 7603
opt_level = None keep_batchnorm_fp32 = None <class 'NoneType'> loss_scale = None <class 'NoneType'>
CUDNN VERSION: 7603
=> creating model 'resnet18' => creating model 'resnet18' main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) main_amp.py:42: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/torch/csrc/utils/tensor_numpy.cpp:141.) tensor[i] += torch.from_numpy(nump_array) Traceback (most recent call last): File "main_amp.py", line 547, in
main() File "main_amp.py", line 247, in main train(train_loader, model, criterion, optimizer, epoch) File "main_amp.py", line 357, in train loss.backward() File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7fbf43f9cbd0> returned NULL without setting an error Traceback (most recent call last): File "main_amp.py", line 547, in main() File "main_amp.py", line 247, in main train(train_loader, model, criterion, optimizer, epoch) File "main_amp.py", line 357, in train loss.backward() File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f29b2420bd0> returned NULL without setting an error Traceback (most recent call last): File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/mklachko/miniconda3/envs/pt/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/mklachko/miniconda3/envs/pt/bin/python', '-u', 'main_amp.py', '--local_rank=1', '--data_dir', '/mnt/ssd2tb/imagenet', '-b', '128']' returned non-zero exit status 1. (pt) mklachko@server:~/$ conda list apex packages in environment at /home/mklachko/miniconda3/envs/pt:
Name Version Build Channel apex 0.1 pypi_0 pypi
@mcarilli this happens regardless of the opt_level (the above trace was from not using any opt_level, just from using Apex version of DDP). The single GPU (non DDP) version of this script works fine. Original Pytorch DDP works fine (this version: https://github.com/pytorch/examples/blob/master/imagenet/main.py). I actually got this error only after updating Apex to the latest, the older version (~6 months old) worked fine, but I don't know which one, or how to downgrade. I think it might have been a version installed without cpp extensions.
Any ideas on what might be wrong here, and what can I do to get Apex to work on my machine (already tried reinstalling it a couple of times).
I am facing same error in doing something similar
I am facing same error in doing something similar
Have you solve this problem?
Yep, I have fixed this by install pytorch from source
Is there any other way other than build pytorch from source?
I'm sure that non DDP version of apex works fine and pytorch DDP version works fine but DDP version of apex didn't work. So the problem maybe occurs in DDP of apex.
Has anyone solved this issue? I also met this issue using Sagemaker container 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
Has anyone solved this issue? I also met this issue using Sagemaker container
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
Apex is not maintained for new version of CUDA, I think we should move to use DDP of pytorch, it is much more easier.
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html