vivekh2000
Results
3
issues of
vivekh2000
I have included the `LayerNorm` layer in the classification head to identify it with the original implementation. It will be easy to compare now. Also, an isolated `LayerNorm` was included,...
This was redundant, not used anywhere.
Since in your code, the `distillation_token` and `distill_mlp` heads are defined in the `DistillWrapper` class, sending the model instance of the `DistillableViT` class to GPU does not send the `distillation_token`...