kotaemon
kotaemon copied to clipboard
[BUG] Information extracted from table/image using Azure Document Intelligence API is not reflected in GraphRAG input
trafficstars
Description
When a PDF document with the following structure is read by Azure Document Intelligence, files for Paragraph 1 and Paragraph 2 are created in the GraphRAG input folder, but no file is created for the Table/Image(description).
Paragraph 1
Table
Paragraph 2
Image
...
Reproduction steps
1. In Retrieval settings > GraphRAG Collection > File loader, select `Azure AI Document Intelligence (figure+table extraction)`
1. Upload a PDF file containing a table in GraphRAG
1. Execute a query related to the table
Screenshots
No response
Logs
No response
Browsers
No response
OS
No response
Additional information
AzureAIDocumentIntelligenceLoader stores Text/Table/Image separately in the Document without duplication, while GraphRAGIndexingPipeline outputs only Text.
I think it would be more appropriate to have a format like ktem_app_data/markdown_cache_dir, where tables and other elements are expanded inline, as the text to be indexed.