Potential Bug in Dropout Logic of forward Method
https://github.com/NeuralCarver/Michelangelo/blob/6d83b0bacef92715dd5179d45647ed9a3d39bc95/michelangelo/models/conditional_encoders/encoder_factory.py#L554
In the forward method of the MoECLIPImageEncoder class, I noticed the following code snippet:
if zero_embedding_radio > 0:
mask = torch.rand((len(image), 1, 1), device=z.device, dtype=z.dtype) >= zero_embedding_radio
z = z + mask.to(z)
The current implementation uses z = z + mask.to(z) to apply the mask. However, I believe the intention of mask here is to act as a dropout mechanism, randomly dropping elements of z to improve the model's generalization ability (as implied by zero_embedding_radio). Using addition (+) does not achieve this effect. This behavior seems inconsistent with the purpose of dropout, where we expect some elements to be zeroed out rather than incremented.
I think the correct operation should be multiplication instead of addition, such as:
z = z * mask.to(z)
This aligns with the standard dropout technique, where a random subset of parameters is masked to prevent overfitting.