pytorch
pytorch copied to clipboard
Recent PyTorch nightly builds breaks DALLE2_pytorch model in Torchbench
🐛 Describe the bug
Example run: https://github.com/pytorch/benchmark/actions/runs/3896093654/jobs/6671547864
This is reproducible on PT 20230112 CUDA 11.6 (fbbb19599a1d162e5927542ed251fd2ba63d5163):
$ python run.py DALLE2_pytorch -d cuda -t train
Running train method from DALLE2_pytorch on cuda in eager mode with input batch size 4.
Traceback (most recent call last):
File "/fsx/users/xzhao9/benchmark/run.py", line 338, in <module>
run_one_step(test, model=m, export_metrics_file=export_metrics_file,
File "/fsx/users/xzhao9/benchmark/run.py", line 96, in run_one_step
func()
File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 277, in invoke
self.train()
File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/DALLE2_pytorch/__init__.py", line 104, in train
loss.backward()
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
This problem exists in 20230101, but doesn't exist in 20221222 (https://github.com/pytorch/benchmark/actions/runs/3760224486/jobs/6390700236).
I suspect this is the root cause: https://github.com/pytorch/pytorch/pull/91029, since it modifies the same file as the log message code, happened between these two days, and mentioned it is BC-breaking.
Versions
Between 20221222 and 20230101 nightly release.
@albanD sounds familiar?
So this is the sort of classic problem where we changed the striding behavior of an op, and now a view() didn't happen. The one thing that makes me an @albanD wonder a little is that it happened in backwards, which kind of implies it's our fault (and not yours; in particular, there's no where you can update a view into a reshape). But I guess we could find that in PyTorch itself and make the update.
That being said, I am unable to repro this problem on torchbench. PyTorch commit hash fa3841ffd49ac3613bc5d1889ca2546004da425e torchbench commit hash bbdc777837bf0c02b00e20b9178a0f6a9595bda4 CUDA 11.4
can you rerun with anomaly mode?
@ezyang
Can you try re-run python install.py DALLE2_pytorch
to reinstall the deps?
Another option is to try to reproduce it in our Docker image: xzhao9/gcp-a100-runner-dind:latest, which should be the same as the CI environment.
I rerun with anomaly mode, and get more outputs:
$ python run.py DALLE2_pytorch -d cuda -t train
Running train method from DALLE2_pytorch on cuda in eager mode with input batch size 4.
/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/__init__.py:197: UserWarning: Error detected in NativeGroupNormBackward0. Traceback of forward call that caused the error:
File "/fsx/users/xzhao9/benchmark/run.py", line 338, in <module>
run_one_step(test, model=m, export_metrics_file=export_metrics_file,
File "/fsx/users/xzhao9/benchmark/run.py", line 96, in run_one_step
func()
File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 277, in invoke
self.train()
File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/DALLE2_pytorch/__init__.py", line 104, in train
loss = decoder(self.sample_images, text=self.sample_text, unet_number=1)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 3266, in forward
losses = self.p_losses(unet, image, times, image_embed = image_embed, text_encodings = text_encodings, lowres_cond_img = lowres_cond_img, predict_x_start = predict_x_start, predict_v = predict_v, learned_variance = learned_variance, is_latent_diffusion = is_latent_diffusion, noise_scheduler = noise_scheduler, lowres_noise_level = lowres_noise_level)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 3049, in p_losses
unet_output = unet(
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 2349, in forward
x = resnet_block(x, t, c)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 1657, in forward
h = self.block1(x, scale_shift = scale_shift)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 1601, in forward
x = self.norm(x)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward
return F.group_norm(
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/fsx/users/xzhao9/benchmark/run.py", line 338, in <module>
run_one_step(test, model=m, export_metrics_file=export_metrics_file,
File "/fsx/users/xzhao9/benchmark/run.py", line 96, in run_one_step
func()
File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 277, in invoke
self.train()
File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/DALLE2_pytorch/__init__.py", line 105, in train
loss.backward()
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
The error happens in GroupNormBackward, and dY is somehow becoming non-contiguous and channels-last (sizes and strides below):
[4, 256, 64, 64][1048576, 1, 16384, 256]
and then fails in
dY.view({N * G, D, HxW})
To paper over the problem we can call .contiguous()
on dY, but it's interesting why behavior changed.
That looks like a channel last Tensor?
Yes it does, but it's not a channels-last run, and all other tensors (e.g. input, or other GroupNorm instances gradients) are contiguous
Well, in the model, the next layer in some cases is https://github.com/lucidrains/einops-exts/blob/main/einops_exts/torch.py This seems to be a pretty complex permute. And so the gradient for it will also get permuted. That can very easily lead to non-contiguous gradient then being sent to GroupNorm.
@xuzhao9 could it be just update to DALLE itself? They moved to version 1.12 in this timeframe https://github.com/lucidrains/DALLE2-pytorch/commits/main
It's not gradient, it happens in forward, where this EinOps permute makes input to convolution channels-last in some cases. This likely leads to channels-last convolution output in forward (I didn't verify, but that should happen) that's probably unpermuted somewhere down the road, and it definitly leads to channels-last gradInput that convolution produces and that then becomes GroupNorm gradOutput.
@xuzhao9 could it be just update to DALLE itself? They moved to version 1.12 in this timeframe https://github.com/lucidrains/DALLE2-pytorch/commits/main
@ngimel I have tried updating to their latest commit (984d62a37), and the issue still exists.
Right, but did it exist before DALLE==1.12? @ezyang could not repro it, and there were meaningful changes between earlier DALLE version and 1.12. From what I see, it's not related to scatter ops (that's the PR you tentatively flagged), it's convolutions + permutations + GroupNorm which didn't really change. I've tried turning off cudnn v8 apis(even though switch to cudnn v8 by default is a day before suspicious range), but that didn't help either.
@ngimel I tried dalle2_pytorch==1.11.4, and got the same error:
$ python -c "import dalle2_pytorch; print(dalle2_pytorch.__version__)"
1.11.4
$ python run.py DALLE2_pytorch -d cuda -t train
Traceback (most recent call last):
File "/fsx/users/xzhao9/benchmark/run.py", line 338, in <module>
run_one_step(test, model=m, export_metrics_file=export_metrics_file,
File "/fsx/users/xzhao9/benchmark/run.py", line 96, in run_one_step
func()
File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 277, in invoke
self.train()
File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/DALLE2_pytorch/__init__.py", line 105, in train
loss.backward()
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
I also tested dalle2_pytorch==1.11.0, the result is the same.
Perhaps this PR is suspicious? https://github.com/pytorch/pytorch/commit/5030929c5d124e68e4d73b8933973f754d2d5d43
Oh indeed, this is suspicious
I think this line is the problem https://github.com/pytorch/pytorch/pull/89485/files#diff-b5faaeef4cddee9a195a6ca3c652be163f38d4fc1b31d0b42ed5944cb41ab67fR138, @xuzhao9 can you try either reverting that PR or just that line and see if it fixes the problem?
And also this change https://github.com/pytorch/pytorch/pull/89485/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8cL1171 doesn't work for GPU
I've traced to regression between Dec 29th and Dec 30th nightly
Perhaps this PR is suspicious? 5030929
So Dec29th nightly was cut from 3d8834bdbf7f5da1163fd7ac543728779b557d29, which happened before, so indeed it seems suspicious.
Yeah the author modified cpu implementation only, but made changes to the common path, so now gpu is getting discontiguous gradients where previously it was guaranteed to get contiguous only. Where are the tests??
Yeah, this is sad. And looks like reverting the change indeed fixes the issue. Will revert and add GPU test so that we would not regress next time change is relanded
Ok, following small change fixes the problem:
diff --git a/tools/autograd/derivatives.yaml b/tools/autograd/derivatives.yaml
index 10a008080e6..3e9b369eaa2 100644
--- a/tools/autograd/derivatives.yaml
+++ b/tools/autograd/derivatives.yaml
@@ -1168,7 +1168,8 @@
rstd: not_implemented("native_layer_norm_backward rstd")
- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
- input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].suggest_memory_format()), input.device().is_xpu() ? input : input.contiguous(input.suggest_memory_format()), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
+ input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(), input.device().is_xpu() ? input : input.contiguous(), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
result0: group_norm_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, result1, result2, group)
result1: group_norm_mean_jvp(input_t, result1, group)
result2: group_norm_invstd_jvp(input_p, input_t, result1, result2, group)
And here is a simple reproducer for the problem (thanks @ngimel for the tip about requires_grad
for the input tensor):
def test_group_norm_backward(device="cuda"):
B,C,W,H=2,4,4,4
net=torch.nn.GroupNorm(B, C).to(device=device)
x=torch.rand(B, C, W, H, device=device, requires_grad=True)
y=net(x)
y.backward(torch.rand(B, C, W, H, device=device).to(memory_format=torch.channels_last))
Fun fact - GroupNorm backwards on CPU never errors out, but will probably yield invalid results if input tensor and gradient memory formats would not be aligned.
@malfet We didn't consider the case: input and gradient memory formats are not be aligned. Can we move the .contiguous
to the native_group_norm_backward
function in aten/src/ATen/native/group_norm.cpp and make the memory format aligned like https://github.com/pytorch/pytorch/pull/92668 or use input's format for both grads[0] and input:
native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())
,
which might pass the skipped part of the test on https://github.com/pytorch/pytorch/pull/92671