PVDM
PVDM copied to clipboard
Code can't adapt to different number of timesteps
The repo as a few hardcoded things that makes it difficult to use with a different setting, like different resolution or timesteps. I think I managed the resolution problem also thanks to this issue. Now I'm really struggling with the timesteps (number of frames in a video) parameter.
Apparently using a number that's not a power of two (8, 16, 32) causes problems with the UNet (when concatenating residuals with the new upsampled dim).
I managed to train the AE with timesteps 8 and res 128, so now it produces an embedding of dim [1, 4, 1536], one for the noisy frames one for the conditioning frames. I also had to change the code in the UNet that is marked with a TODO:
# TODO: treat 32 and 16 as variables
h_xy = h[:, :, 0:32*32].view(h.size(0), h.size(1), 32, 32)
h_yt = h[:, :, 32*32:32*(32+16)].view(h.size(0), h.size(1), 16,
h_xt = h[:, :, 32*(32+16):32*(32+16+16)].view(h.size(0), h.size(1), 16, 32)
So I defined a variable n2 = 32 and n = n2 // 2 to replace the raw numbers. To use timesteps 8 i set n2 to 16, which I'm not sure is correct but if for timesteps 16 was 32 then the thing should hold.
The problem now is that the forward pass of the UNet produces a tensor of shape [1, 4, 512], so there's a dimension mismatch when trying to compute the loss. I'm referring to the code in the function
def p_losses(self, x_start, cond, t, noise=None):
noise = default(noise, lambda: torch.randn_like(x_start))
x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
model_out = self.model(x_noisy, cond, t)
...
loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2])
Which causes the following error:
RuntimeError: The size of tensor a (1536) must match the size of tensor b (512) at non-singleton dimension 2
@sihyun-yu Did I miss anything else that should be changed in order to make this code "timesteps adaptive" ?