[Question]: Associate Images with Their Semantic Context in Search Results
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
Describe your problem
Overview
The current image retrieval mechanism in the knowledge base can only recognize text content within images, but cannot identify the semantic association between images and surrounding text. This leads to the system being unable to provide images as supplementary information when answering questions related to image context, affecting the completeness and intuitiveness of knowledge base responses.
Problem Description
Current Behavior
- Precise Matching: When a user's question is highly relevant to the OCR text within PDF images, the image appears as a thumbnail in the response. This part works well.
- Missing Context: When a user's question is related to descriptive text around an image (e.g., paragraphs above the image, captions below the image) but not related to the OCR text within the image, the system only retrieves and displays the text, ignoring the image that should serve as a key supplement.
Expected Behavior
The system should be able to understand the association between images and their context. When retrieving a piece of text, if that text is semantically closely related to a nearby image (e.g., the text is an explanation or summary of the image), the system should also consider the image as relevant content and provide it together in the response.
I fully agree with your idea, and outputting images is quite necessary.
Related: https://github.com/infiniflow/ragflow/issues/9100
We could also achieve this goal by saving surrounding text as images' context when chunking the parsed files.
Hi, now we support adding context for figure and table in chunks~
also i want to know which method are you using? different parsing methods lead to different results.
e.g. For manual+pdf and general+word, the surrounding text and the "VLM description of the image" are merged into one chunk.
and when I use "general" to parse a PDF, the text around the picture gets merged into one chunk, while the picture is extracted as a standalone chunk. This causes the image to be separated from its context.
However, the rules are different when it comes to replying with images in chat.
If you have any other findings, welcome to share them with me!