vit-pytorch
vit-pytorch copied to clipboard
Vit MAE reconstruction size mismatch
I'm trying to Train ViT with Masked Autoencoder training but I'm getting an error when running MAE.forward() The tensor size of the predicted pixel values is of by a factor of 4 in comparison to the masked_patches tensor in the MSE_loss call.
RuntimeError: The size of tensor a (1024) must match the size of tensor b (4096) at non-singleton dimension 2
I've tried different settings but the factor 4 size mismatch stays.
I've also tried a hack to fix the predicted pixel values size by adding a factor 4 to the to_pixels output layer neuron count. This fixes the problem in the MSE_loss call but introduces a new one, namely: The gradients don't match up in the backward call.
RuntimeError: Function MmBackward returned an invalid gradient at index 1 - got [4096, 1024] but expected shape compatible with [1024, 1024]
But now I don't know how to debug further.
my last settings where:
'model': { 'encoder_depth': 5, 'decoder_depth': 5, 'patch_size': 32, 'num_classes': 1000, 'channels': 1, 'dim': 1024, 'heads': 8, 'mlp_dim': 2048, 'masking_ratio': 0.75, 'decoder_dim': 512, },
Hi Rhinigtas! Could you show what your full training script looks like? Perhaps I can spot the error more easily that way