Integrate Docling in Elasticsearch
Requested feature
Background
Docling reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown), converts them in a unified data model with rich document representation ( DoclingDocument class in docling-core) and exports to Markdown and JSON.
Elasticsearch is an open source distributed, RESTful search and analytics engine, scalable data store, and vector database. Among its features, the Attachment Processor lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika
Goal
This feature request consists of designing and implementing an integration solution of Docling with Elasticsearch to enable the parsing and indexing of document content in common formats (PDF, HTML, MS Office, ...).
Some aspects will need to be addressed, including:
- Running Docling on the JVM, since Docling is written in Python and Elasticsearch in Java. An existing option, already in place in Elasticsearch, is leveraging Jython.
- Deciding the integration options: new implementation of Attachment Processor, an Elasticsearch plugin, ...
- The indexing options:
- text only: export the resulting Docling object to Markdown and index it as a
texttype field, as it is done with thecontentfield in the Attachment Processor - document structure: export the resulting Docling object to JSON and allow users to select the fields to extract and index (such as paragraphs, tables, ...).
- text only: export the resulting Docling object to Markdown and index it as a
Alternatives
- Do not provide a native integration, instead provide a tool in Docling to create an index with custom mappings to store and index Docling documents exported as JSON. This approach was implemented in the document legacy version.
- Explore other open-source, enterprise search suites, like OpenSearch