pytorch Recent PyTorch nightly builds breaks DALLE2

🐛 Describe the bug

Example run: https://github.com/pytorch/benchmark/actions/runs/3896093654/jobs/6671547864

This is reproducible on PT 20230112 CUDA 11.6 (fbbb19599a1d162e5927542ed251fd2ba63d5163):

$ python run.py DALLE2_pytorch -d cuda -t train
Running train method from DALLE2_pytorch on cuda in eager mode with input batch size 4.
Traceback (most recent call last):
  File "/fsx/users/xzhao9/benchmark/run.py", line 338, in <module>
    run_one_step(test, model=m, export_metrics_file=export_metrics_file,
  File "/fsx/users/xzhao9/benchmark/run.py", line 96, in run_one_step
    func()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 277, in invoke
    self.train()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/DALLE2_pytorch/__init__.py", line 104, in train
    loss.backward()
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

This problem exists in 20230101, but doesn't exist in 20221222 (https://github.com/pytorch/benchmark/actions/runs/3760224486/jobs/6390700236).

I suspect this is the root cause: https://github.com/pytorch/pytorch/pull/91029, since it modifies the same file as the log message code, happened between these two days, and mentioned it is BC-breaking.

Versions

Between 20221222 and 20230101 nightly release.

Jan 13 '23 18:01 xuzhao9

@albanD sounds familiar?

Jan 13 '23 19:01 malfet

So this is the sort of classic problem where we changed the striding behavior of an op, and now a view() didn't happen. The one thing that makes me an @albanD wonder a little is that it happened in backwards, which kind of implies it's our fault (and not yours; in particular, there's no where you can update a view into a reshape). But I guess we could find that in PyTorch itself and make the update.

That being said, I am unable to repro this problem on torchbench. PyTorch commit hash fa3841ffd49ac3613bc5d1889ca2546004da425e torchbench commit hash bbdc777837bf0c02b00e20b9178a0f6a9595bda4 CUDA 11.4

can you rerun with anomaly mode?

Jan 13 '23 19:01 ezyang

@ezyang Can you try re-run python install.py DALLE2_pytorch to reinstall the deps? Another option is to try to reproduce it in our Docker image: xzhao9/gcp-a100-runner-dind:latest, which should be the same as the CI environment.

I rerun with anomaly mode, and get more outputs:

$ python run.py DALLE2_pytorch -d cuda -t train
Running train method from DALLE2_pytorch on cuda in eager mode with input batch size 4.
/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/__init__.py:197: UserWarning: Error detected in NativeGroupNormBackward0. Traceback of forward call that caused the error:
  File "/fsx/users/xzhao9/benchmark/run.py", line 338, in <module>
    run_one_step(test, model=m, export_metrics_file=export_metrics_file,
  File "/fsx/users/xzhao9/benchmark/run.py", line 96, in run_one_step
    func()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 277, in invoke
    self.train()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/DALLE2_pytorch/__init__.py", line 104, in train
    loss = decoder(self.sample_images, text=self.sample_text, unet_number=1)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 3266, in forward
    losses = self.p_losses(unet, image, times, image_embed = image_embed, text_encodings = text_encodings, lowres_cond_img = lowres_cond_img, predict_x_start = predict_x_start, predict_v = predict_v, learned_variance = learned_variance, is_latent_diffusion = is_latent_diffusion, noise_scheduler = noise_scheduler, lowres_noise_level = lowres_noise_level)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 3049, in p_losses
    unet_output = unet(
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 2349, in forward
    x = resnet_block(x, t, c)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 1657, in forward
    h = self.block1(x, scale_shift = scale_shift)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/dalle2_pytorch/dalle2_pytorch.py", line 1601, in forward
    x = self.norm(x)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward
    return F.group_norm(
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/fsx/users/xzhao9/benchmark/run.py", line 338, in <module>
    run_one_step(test, model=m, export_metrics_file=export_metrics_file,
  File "/fsx/users/xzhao9/benchmark/run.py", line 96, in run_one_step
    func()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 277, in invoke
    self.train()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/DALLE2_pytorch/__init__.py", line 105, in train
    loss.backward()
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Jan 13 '23 20:01 xuzhao9

The error happens in GroupNormBackward, and dY is somehow becoming non-contiguous and channels-last (sizes and strides below):

[4, 256, 64, 64][1048576, 1, 16384, 256]

and then fails in

dY.view({N * G, D, HxW})

To paper over the problem we can call .contiguous() on dY, but it's interesting why behavior changed.

Jan 13 '23 22:01 ngimel

That looks like a channel last Tensor?

Jan 13 '23 22:01 albanD

Yes it does, but it's not a channels-last run, and all other tensors (e.g. input, or other GroupNorm instances gradients) are contiguous

Jan 13 '23 23:01 ngimel

Well, in the model, the next layer in some cases is https://github.com/lucidrains/einops-exts/blob/main/einops_exts/torch.py This seems to be a pretty complex permute. And so the gradient for it will also get permuted. That can very easily lead to non-contiguous gradient then being sent to GroupNorm.

Jan 13 '23 23:01 albanD

@xuzhao9 could it be just update to DALLE itself? They moved to version 1.12 in this timeframe https://github.com/lucidrains/DALLE2-pytorch/commits/main

Jan 14 '23 00:01 ngimel

It's not gradient, it happens in forward, where this EinOps permute makes input to convolution channels-last in some cases. This likely leads to channels-last convolution output in forward (I didn't verify, but that should happen) that's probably unpermuted somewhere down the road, and it definitly leads to channels-last gradInput that convolution produces and that then becomes GroupNorm gradOutput.

Jan 14 '23 00:01 ngimel

@xuzhao9 could it be just update to DALLE itself? They moved to version 1.12 in this timeframe https://github.com/lucidrains/DALLE2-pytorch/commits/main

@ngimel I have tried updating to their latest commit (984d62a37), and the issue still exists.

Jan 14 '23 00:01 xuzhao9

Right, but did it exist before DALLE==1.12? @ezyang could not repro it, and there were meaningful changes between earlier DALLE version and 1.12. From what I see, it's not related to scatter ops (that's the PR you tentatively flagged), it's convolutions + permutations + GroupNorm which didn't really change. I've tried turning off cudnn v8 apis(even though switch to cudnn v8 by default is a day before suspicious range), but that didn't help either.

Jan 14 '23 01:01 ngimel

@ngimel I tried dalle2_pytorch==1.11.4, and got the same error:

$ python -c "import dalle2_pytorch; print(dalle2_pytorch.__version__)"
1.11.4
$ python run.py DALLE2_pytorch -d cuda -t train
Traceback (most recent call last):
  File "/fsx/users/xzhao9/benchmark/run.py", line 338, in <module>
    run_one_step(test, model=m, export_metrics_file=export_metrics_file,
  File "/fsx/users/xzhao9/benchmark/run.py", line 96, in run_one_step
    func()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 277, in invoke
    self.train()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/DALLE2_pytorch/__init__.py", line 105, in train
    loss.backward()
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/data/home/xzhao9/cluster/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

I also tested dalle2_pytorch==1.11.0, the result is the same.

Jan 17 '23 17:01 xuzhao9

Perhaps this PR is suspicious? https://github.com/pytorch/pytorch/commit/5030929c5d124e68e4d73b8933973f754d2d5d43

Jan 19 '23 19:01 xuzhao9

Oh indeed, this is suspicious

Jan 19 '23 19:01 ngimel

I think this line is the problem https://github.com/pytorch/pytorch/pull/89485/files#diff-b5faaeef4cddee9a195a6ca3c652be163f38d4fc1b31d0b42ed5944cb41ab67fR138, @xuzhao9 can you try either reverting that PR or just that line and see if it fixes the problem?

Jan 19 '23 19:01 ngimel

And also this change https://github.com/pytorch/pytorch/pull/89485/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8cL1171 doesn't work for GPU

Jan 19 '23 19:01 ngimel

I've traced to regression between Dec 29th and Dec 30th nightly

Jan 19 '23 20:01 malfet

Perhaps this PR is suspicious? 5030929

So Dec29th nightly was cut from 3d8834bdbf7f5da1163fd7ac543728779b557d29, which happened before, so indeed it seems suspicious.

Jan 19 '23 20:01 malfet

Yeah the author modified cpu implementation only, but made changes to the common path, so now gpu is getting discontiguous gradients where previously it was guaranteed to get contiguous only. Where are the tests??

Jan 19 '23 20:01 ngimel

Yeah, this is sad. And looks like reverting the change indeed fixes the issue. Will revert and add GPU test so that we would not regress next time change is relanded

Jan 19 '23 21:01 malfet

Ok, following small change fixes the problem:

diff --git a/tools/autograd/derivatives.yaml b/tools/autograd/derivatives.yaml
index 10a008080e6..3e9b369eaa2 100644
--- a/tools/autograd/derivatives.yaml
+++ b/tools/autograd/derivatives.yaml
@@ -1168,7 +1168,8 @@
   rstd: not_implemented("native_layer_norm_backward rstd")
 
 - name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
-  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].suggest_memory_format()), input.device().is_xpu() ? input : input.contiguous(input.suggest_memory_format()), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
+  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(), input.device().is_xpu() ? input : input.contiguous(), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
   result0: group_norm_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, result1, result2, group)
   result1: group_norm_mean_jvp(input_t, result1, group)
   result2: group_norm_invstd_jvp(input_p, input_t, result1, result2, group)

Jan 19 '23 22:01 malfet

And here is a simple reproducer for the problem (thanks @ngimel for the tip about requires_grad for the input tensor):

def test_group_norm_backward(device="cuda"):
    B,C,W,H=2,4,4,4
    net=torch.nn.GroupNorm(B, C).to(device=device)
    x=torch.rand(B, C, W, H, device=device, requires_grad=True)
    y=net(x)
    y.backward(torch.rand(B, C, W, H, device=device).to(memory_format=torch.channels_last))

Jan 20 '23 01:01 malfet

Fun fact - GroupNorm backwards on CPU never errors out, but will probably yield invalid results if input tensor and gradient memory formats would not be aligned.

Jan 20 '23 06:01 malfet

@malfet We didn't consider the case: input and gradient memory formats are not be aligned. Can we move the .contiguous to the native_group_norm_backward function in aten/src/ATen/native/group_norm.cpp and make the memory format aligned like https://github.com/pytorch/pytorch/pull/92668 or use input's format for both grads[0] and input: native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()), which might pass the skipped part of the test on https://github.com/pytorch/pytorch/pull/92671

Jan 20 '23 06:01 CaoE

pytorch
pytorch copied to clipboard

Recent PyTorch nightly builds breaks DALLE2_pytorch model in Torchbench

🐛 Describe the bug

Versions

pytorch pytorch copied to clipboard

Recent PyTorch nightly builds breaks DALLE2_pytorch model in Torchbench

🐛 Describe the bug

Versions

pytorch
pytorch copied to clipboard