ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: Associate Images with Their Semantic Context in Search Results

Open haomiemie opened this issue 2 months ago • 4 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

Overview

The current image retrieval mechanism in the knowledge base can only recognize text content within images, but cannot identify the semantic association between images and surrounding text. This leads to the system being unable to provide images as supplementary information when answering questions related to image context, affecting the completeness and intuitiveness of knowledge base responses.

Problem Description

Current Behavior

  • Precise Matching: When a user's question is highly relevant to the OCR text within PDF images, the image appears as a thumbnail in the response. This part works well.
  • Missing Context: When a user's question is related to descriptive text around an image (e.g., paragraphs above the image, captions below the image) but not related to the OCR text within the image, the system only retrieves and displays the text, ignoring the image that should serve as a key supplement.

Expected Behavior

The system should be able to understand the association between images and their context. When retrieving a piece of text, if that text is semantically closely related to a nearby image (e.g., the text is an explanation or summary of the image), the system should also consider the image as relevant content and provide it together in the response.

haomiemie avatar Oct 31 '25 06:10 haomiemie

I fully agree with your idea, and outputting images is quite necessary.

liuzhuosin avatar Nov 05 '25 09:11 liuzhuosin

Related: https://github.com/infiniflow/ragflow/issues/9100

ZhenhangTung avatar Nov 14 '25 11:11 ZhenhangTung

We could also achieve this goal by saving surrounding text as images' context when chunking the parsed files.

ZhenhangTung avatar Nov 14 '25 11:11 ZhenhangTung

Hi, now we support adding context for figure and table in chunks~

also i want to know which method are you using? different parsing methods lead to different results.

e.g. For manual+pdf and general+word, the surrounding text and the "VLM description of the image" are merged into one chunk.

and when I use "general" to parse a PDF, the text around the picture gets merged into one chunk, while the picture is extracted as a standalone chunk. This causes the image to be separated from its context.

However, the rules are different when it comes to replying with images in chat. Image

If you have any other findings, welcome to share them with me!

redredrrred avatar Nov 27 '25 07:11 redredrrred