LLaVA Question about the OCR capability

Question about the OCR capability

Open MIL-VLG opened this issue 1 year ago • 2 comments

Great work indeed!

From the description in the paper, I do not find any special OCR module. I am curious how LLaVA obtains the ability to understand the text in the image (e.g., the famous examples of chicken nuggets). Is there any magic in the training dataset?

Apr 18 '23 08:04 MIL-VLG

CLIP is all you need

Apr 19 '23 03:04 152334H

The so-called emerging properties. The pre-trained models of the visual encoder and LLM already have a good understanding of their respective domain data (with their own structured feature space). We link them together using a linear projection layer, which can be considered as a visual tokenization step, which embeds and aligns visual tokens into the pre-trained language model's word embedding space. Therefore these projected visual embeddings are very close the corresponding word embeddings, which make OCR possible.

This step of image-text feature alignment shows good ability via training with very little paired image-text data.

Apr 19 '23 03:04 ChunyuanLI

Hello @ChunyuanLI I wonder do you use training paired image-text data that contains text in the image? Could you show me an example of them? Thanks a lot!

Aug 28 '23 06:08 erjiaxiao

LLaVA LLaVA copied to clipboard

Question about the OCR capability

LLaVA
LLaVA copied to clipboard