Integrate Docling in Elasticsearch

Open ceberam opened this issue 1 year ago • 0 comments

Requested feature

Background

Docling reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown), converts them in a unified data model with rich document representation ( DoclingDocument class in docling-core) and exports to Markdown and JSON. Elasticsearch is an open source distributed, RESTful search and analytics engine, scalable data store, and vector database. Among its features, the Attachment Processor lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika

Goal

This feature request consists of designing and implementing an integration solution of Docling with Elasticsearch to enable the parsing and indexing of document content in common formats (PDF, HTML, MS Office, ...).

Some aspects will need to be addressed, including:

Running Docling on the JVM, since Docling is written in Python and Elasticsearch in Java. An existing option, already in place in Elasticsearch, is leveraging Jython.
Deciding the integration options: new implementation of Attachment Processor, an Elasticsearch plugin, ...
The indexing options:
- text only: export the resulting Docling object to Markdown and index it as a text type field, as it is done with the content field in the Attachment Processor
- document structure: export the resulting Docling object to JSON and allow users to select the fields to extract and index (such as paragraphs, tables, ...).

Alternatives

Do not provide a native integration, instead provide a tool in Docling to create an index with custom mappings to store and index Docling documents exported as JSON. This approach was implemented in the document legacy version.
Explore other open-source, enterprise search suites, like OpenSearch

Nov 29 '24 10:11 ceberam