Compact-Transformers
Compact-Transformers copied to clipboard
NLP Results and CCT size
Thanks for making transformers much more approachable! The down side of this may be stupid questions from beginners like me (still, I hope this is not one). In the NLP results the five different datasets had best accuracy with five different CCT models. The Transformer, ViT-Lite, and CVT models almost have accuracy inversely correlated with size. My "intuition" is that bigger models would be better (for example, LLM often give the best results). Maybe the small size of the datasets means larger models can't be trained as well. Maybe the embedding is not optimized for transformers. Could you please offer insight into this?
The CCT is an encoder architecture. Are there small transformers that demonstrates an encoder/decoder or decoder architecture? How would you expect a decoder implementation of CCT to perform in generative tasks?
There's two things we should note from our experiments here that I think are important.
-
CCT is about working with smaller datasets. Our vision goals were not to learn the best imagenet results, but to demonstrate that vision transformers could actually work in small domains without pretraining. We also wanted to show that this could be done with a minimal number of parameters. This is a call back to the classic flexibility vs interpretability tradeoff in statistics. We generally want the smallest model with the highest generalization performance because we don't want to overparameterize.
-
We did a large number of variations on our network compared to many other works. We were able to do this because of the compact size, but we still have far more configurations than would be typically seen. If we look at the NLP results you'll notice that most of them are relatively close to one another. I'd probably use CCT-4/1x1 for most NLP tasks that are small.
LLMs are a vastly different domain and are accomplishing different goals. While there may be some shared tasks, the purpose of these works is quite different. Here we're trying to provide the utility of transformers to small datasets, which enables an average user or scientist (who doesn't have a large compute budget) to be able to train their networks from scratch. LLMs, and other large networks (including ViTs), are attempting to achieve the highest performance on tasks without care for compute budgets. Both these goals are important, but different.
As for the decoder structure, we were just focused on classification so it only makes sense to use an encoder style transformer. We didn't need the cross attention. You're welcome to incorporate that if you wish to extend this to other types of problems and we'd love to see the results.
I hope this helps.
Thank you very much for your generous reply. I found CCT while looking for a transformer that could allow experimentation with a consumer GPU. It is great to be able to explore transformers with limited resources! My concern with the NLP performance not improving with larger CCT models was mainly whether, while working with CCT, will insights scale to much larger transformers. My assumption is that many insights will but I wonder if the NLP tasks were indicating some limitation.
For experimenting with transformer architecture there could be advantages to having both the encoder and encoder. I'm unsure what a realistic benchmark for a transformer the size of CCT, with an encoder, would be. I assume image generation would be the right task to focus on. Maybe a GAN building on CCT in a benchmark like https://paperswithcode.com/sota/image-generation-on-cifar-10
Thanks again for your work.