LLaVA
LLaVA copied to clipboard
Question about the OCR capability
Great work indeed!
From the description in the paper, I do not find any special OCR module. I am curious how LLaVA obtains the ability to understand the text in the image (e.g., the famous examples of chicken nuggets). Is there any magic in the training dataset?
CLIP is all you need
The so-called emerging properties. The pre-trained models of the visual encoder and LLM already have a good understanding of their respective domain data (with their own structured feature space). We link them together using a linear projection layer, which can be considered as a visual tokenization step, which embeds and aligns visual tokens into the pre-trained language model's word embedding space. Therefore these projected visual embeddings are very close the corresponding word embeddings, which make OCR possible.
This step of image-text feature alignment shows good ability via training with very little paired image-text data.
Hello @ChunyuanLI I wonder do you use training paired image-text data that contains text in the image? Could you show me an example of them? Thanks a lot!