benchmark
benchmark copied to clipboard
RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU.
Issue Description
I encounter a RuntimeError related to gradient computation when enabling accuracy checks during the training of DALLE2_pytorch in a GPU docker environment. The training runs without issues when the --accuracy
flag is not used.
Steps to Reproduce
python install.py DALLE2_pytorch
python run.py DALLE2_pytorch -d cuda -t train --accuracy
Expected Behavior The training process should run without errors and perform accuracy checks without causing runtime errors.
Actual Behavior The script executes successfully without the --accuracy flag. However, when the accuracy check is enabled, it fails with the following error message:
fp64 golden ref were not generated for DALLE2_pytorch. Setting accuracy check to cosine
element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
File "/benchmark/torchbenchmark/util/env_check.py", line 635, in check_accuracy
correct_result = run_n_iterations(
File "/benchmark/torchbenchmark/util/env_check.py", line 504, in run_n_iterations
_model_iter_fn(mod, inputs, contexts, optimizer, collect_outputs=False)
File "/benchmark/torchbenchmark/util/env_check.py", line 497, in _model_iter_fn
return forward_and_backward_pass(
File "/benchmark/torchbenchmark/util/env_check.py", line 480, in forward_and_backward_pass
DummyGradScaler().scale(loss).backward(retain_graph=True)
File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Additional Context PyTorch version: 2.2.2 CUDA version: 12.4.0.041
I can confirm that this can be reproduced in the docker environment. @FindHao Can you help take a look at this issue?
@xuzhao9 The problem also occurs on the previous version of TorchBench(ghcr.io/pytorch/torchbench:dev20230619). It looks like it is from the first time DALLE2 was included in TorchBench. I'm not sure if we can fix it on our side or from the upstream repo since we have limited control over the model's init.py. I'll have a try.
We are dropping DALLE2_pytorch because it does not support numpy 2.0: https://github.com/pytorch/benchmark/pull/2311