LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

Question about the OCR capability

Open MIL-VLG opened this issue 1 year ago • 2 comments

Great work indeed!

From the description in the paper, I do not find any special OCR module. I am curious how LLaVA obtains the ability to understand the text in the image (e.g., the famous examples of chicken nuggets). Is there any magic in the training dataset?

MIL-VLG avatar Apr 18 '23 08:04 MIL-VLG

CLIP is all you need

152334H avatar Apr 19 '23 03:04 152334H

The so-called emerging properties. The pre-trained models of the visual encoder and LLM already have a good understanding of their respective domain data (with their own structured feature space). We link them together using a linear projection layer, which can be considered as a visual tokenization step, which embeds and aligns visual tokens into the pre-trained language model's word embedding space. Therefore these projected visual embeddings are very close the corresponding word embeddings, which make OCR possible.

This step of image-text feature alignment shows good ability via training with very little paired image-text data.

ChunyuanLI avatar Apr 19 '23 03:04 ChunyuanLI

Hello @ChunyuanLI I wonder do you use training paired image-text data that contains text in the image? Could you show me an example of them? Thanks a lot!

erjiaxiao avatar Aug 28 '23 06:08 erjiaxiao