kotaemon icon indicating copy to clipboard operation
kotaemon copied to clipboard

[BUG] Information extracted from table/image using Azure Document Intelligence API is not reflected in GraphRAG input

Open hide212131 opened this issue 10 months ago • 0 comments
trafficstars

Description

When a PDF document with the following structure is read by Azure Document Intelligence, files for Paragraph 1 and Paragraph 2 are created in the GraphRAG input folder, but no file is created for the Table/Image(description).

Paragraph 1
Table
Paragraph 2
Image
...

Reproduction steps

1. In Retrieval settings > GraphRAG Collection > File loader, select `Azure AI Document Intelligence (figure+table extraction)`
1. Upload a PDF file containing a table in GraphRAG
1. Execute a query related to the table

Screenshots

No response

Logs

No response

Browsers

No response

OS

No response

Additional information

AzureAIDocumentIntelligenceLoader stores Text/Table/Image separately in the Document without duplication, while GraphRAGIndexingPipeline outputs only Text.

I think it would be more appropriate to have a format like ktem_app_data/markdown_cache_dir, where tables and other elements are expanded inline, as the text to be indexed.

hide212131 avatar Jan 03 '25 12:01 hide212131