Resampler
I added the perceiver resampler as used in OpenFlamingo and suggested by finding 4 from "What matters when building vision-language models?" (https://arxiv.org/pdf/2405.02246) by Huggingface.
"Reducing the number of visual tokens with learned pooling significantly improves compute efficiency at training and inference while improving performance on downstream tasks."
I've experimented with various projection architectures (perceiver-resampler, MAP, C-Abstractor). C-Abstractor worked best, but even it comes with the trade-off of 2-5% drop on the benchmarks we're measuring. It is a good option for some situations (e.g. running on a Raspberry Pi), but I would want to explore alternative approaches like TokMe first.
I've experimented with various projection architectures (perceiver-resampler, MAP, C-Abstractor). C-Abstractor worked best, but even it comes with the trade-off of 2-5% drop on the benchmarks we're measuring. It is a good option for some situations (e.g. running on a Raspberry Pi), but I would want to explore alternative approaches like TokMe first.
Ok interesting.
I found the resampler applied in multiple papers already. Maybe it doesn't work as well when the model is already small.
I will close it for now.
Not my intent to re-open this PR, in here to say I'm currently testing on a raspberry pi. At the moment the "bottleneck" is that time to process those 729 image tokens via the language model. I am not caught up on the latest research you both mention, but am thinking of something as easy as just inputting Nth token, or maybe only the tokens from the middle 50% of image, just to get that response streaming quicker. I can let you know how this goes, only obstacle stopping me as far as I can tell is making sure it gets saved to .gguf correctly.