vivekh2000

Results 3 issues of vivekh2000

I have included the `LayerNorm` layer in the classification head to identify it with the original implementation. It will be easy to compare now. Also, an isolated `LayerNorm` was included,...

This was redundant, not used anywhere.

Since in your code, the `distillation_token` and `distill_mlp` heads are defined in the `DistillWrapper` class, sending the model instance of the `DistillableViT` class to GPU does not send the `distillation_token`...