bkuster
Results
2
comments of
bkuster
(this is speculation/my understanding, not 100% accurate answer) 1) The "pretrain_mlp_adapter" is the file for the multi-layer perceptron weights. (the output tokens of the CLIP encoder are converted into "visual"...
As a hack, you can try "merging" several images into 1 image, but you'd probably have to finetune to model a bit.