4 Minor issues
Hey, in the notebook, you ask for things to improve. Here is a list of the things I think might be errors or simply unusual details. I am not an expert on NLP, so please let me know if those things are actually intended and no bugs.
-
The ViT is not pretrained, but certain parameters are frozen anyway, without training, such as the class token and the patch, and the positional embedding https://github.com/shreydan/VisionGPT2/blob/4c504889709dd8c82316460af699c2aa5e39e2c3/model.py#L176
-
The ViT and GPT2 are aligned layer-to-layer, which is not really a bug, but an odd decision, according to ChatGPT :) https://github.com/shreydan/VisionGPT2/blob/4c504889709dd8c82316460af699c2aa5e39e2c3/model.py#L256
-
Code structure, I do not really understand why you copy so many parameters. That makes it really tough to keep track, at least for me. E.g,. self.blocks are the blocks of the ViT, but the variable name does not specify that; instead, I would just leave it self.vit.blocks, but I guess that's up to personal preference. https://github.com/shreydan/VisionGPT2/blob/4c504889709dd8c82316460af699c2aa5e39e2c3/model.py#L156
-
In the inference, the beginning-of-sentence (BOS) token is used, but as far as I see, the token is not used during training, so I would imagine that this might create an issue.
Also, it would be great to know what hardware, specifically what GPU, you used for training.
Other than that, nice code base, and surprisingly good results. Thank you!