open-webui
open-webui copied to clipboard
WIP: docling integration
Changelog Entry
Description
These are very minimal changes to test docling for experimental purposes, not ready to be released yet. The goal is just to highlight what could be done for integrating docling for contributors who would have time for further testing and implementation. I will not have enough time in the coming days or weeks to build further, please do feel free to build upon this.
-
Approach 1 : using langchain_docling as in the first commit of this PR. I made it work but it used the CPU and not the GPU, leading to unreasonable processing times (a few minutes) for hundreds of pages PDFs. Proposed improvements:
- find how to use GPU when available
- warn for processing times for long documents in case of CPU only and/or fallback to default engine
- handle full offline mode by integrating models into the library, see https://github.com/DS4SD/docling/issues/326
-
Approach 2 : using docker image as Docling team has been recently active in docling-serve (see https://github.com/DS4SD/docling-serve/pkgs/container/docling-serve) and we can probably expect soon a docling docker image as for Apache Tika. Then Ctrl+F of Tika usage in Open WebUI repo and replicate the logic for Docling. Here also to pay attention to CPU/GPU considerations.
Both approaches could possibly be implemented so that the user can choose between both options depending on their needs.
Added
- [List any new features, functionalities, or additions]
Changed
- [List any changes, updates, refactorings, or optimizations]
Deprecated
- [List any deprecated functionality or features that have been removed]
Removed
- [List any removed features, files, or functionalities]
Fixed
- [List any fixes, corrections, or bug fixes]
Security
- [List any new or updated security-related changes, including vulnerability fixes]
Breaking Changes
- BREAKING CHANGE: [List any breaking changes affecting compatibility or functionality]
Additional Information
- [Insert any additional context, notes, or explanations for the changes]
- [Reference any related issues, commits, or other relevant information]
Screenshots or Videos
- [Attach any relevant screenshots or videos demonstrating the changes]
Pull Request Checklist
Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.
Before submitting, make sure you've checked the following:
- [ ] Target branch: Please verify that the pull request targets the
devbranch. - [ ] Description: Provide a concise description of the changes made in this pull request.
- [ ] Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
- [ ] Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
- [ ] Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
- [ ] Testing: Have you written and run sufficient tests for validating the changes?
- [ ] Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
- [ ] Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
- BREAKING CHANGE: Significant changes that may affect compatibility
- build: Changes that affect the build system or external dependencies
- ci: Changes to our continuous integration processes or workflows
- chore: Refactor, cleanup, or other non-functional code changes
- docs: Documentation update or addition
- feat: Introduces a new feature or enhancement to the codebase
- fix: Bug fix or error correction
- i18n: Internationalization or localization changes
- perf: Performance improvement
- refactor: Code restructuring for better maintainability, readability, or scalability
- style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
- test: Adding missing tests or correcting existing tests
- WIP: Work in progress, a temporary label for incomplete or ongoing work
Thanks for the work done. Better RAG will need better OCR tools. Docling matches perfectly. Integrating docling will have a huge impact on adoption of openwebui.
I am more aligned with your second approach where we use docling serve. Personnel I have tested the initial implementation here https://github.com/drmingler/docling-api which is more industrialized but it is sure the docling will go this direction .
Could we have input of the official open webui team on how they want to work on docling? https://github.com/open-webui/open-webui/issues/7033
Switching between several OCR sever is a good idea. I do not know if it is best to do it on general settings or in each knowledge based . See https://github.com/open-webui/open-webui/discussions/9361
I am not a big fan of the LangChain Docling parser because it is too inflexible and rigid. Additionally, I am not sure if all parameters from Docling can be properly set in the LangChain wrapper, which further limits its flexibility. Docling is constantly evolving, and I am not sure if LangChain will implement necessary changes quickly enough. Furthermore, by using the LangChain wrapper, important metadata, such as the page label of the document, is lost.
For these reasons, I have written my own wrapper that inherits from LangChain’s BaseLoader. This ensures seamless integration into the existing LangChain loader and document object structure used in OpenWebUI while still allowing full functionality of Docling to be utilized.
My Implementation
import fitz # PyMuPDF
import tempfile
import os
from pathlib import Path
from typing import List, Iterator
from langchain_core.documents import Document
from langchain_core.document_loaders import BaseLoader
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
EasyOcrOptions,
OcrMacOptions,
PdfPipelineOptions,
RapidOcrOptions,
TesseractCliOcrOptions,
TesseractOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
class PDFLoaderWrapper(BaseLoader):
def __init__(self, file_path: str):
self.file_path = file_path
self.converter = self._initialize_converter()
def _initialize_converter(self):
from docling.datamodel.settings import settings
settings.perf.elements_batch_size = 3
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.do_formula_enrichment = True
pipeline_options.do_code_enrichment= True
# ocr_options = RapidOcrOptions(force_full_page_ocr=True,use_cuda=True,language='de')
ocr_options = EasyOcrOptions(force_full_page_ocr=True,use_gpu=True)
pipeline_options.ocr_options = ocr_options
return DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
}
)
def _split_pdf_temporarily(self) -> List[dict]:
doc = fitz.open(self.file_path)
temp_files = []
for page_num in range(len(doc)):
temp_file = tempfile.NamedTemporaryFile(suffix=".pdf", delete=False)
temp_path = temp_file.name
temp_file.close()
new_doc = fitz.open()
new_doc.insert_pdf(doc, from_page=page_num, to_page=page_num)
new_doc.save(temp_path)
new_doc.close()
temp_files.append({"page": page_num + 1, "path": temp_path})
return temp_files
def load(self) -> List[Document]:
temp_pdfs = self._split_pdf_temporarily()
documents = []
for pdf_info in temp_pdfs:
pdf_path = pdf_info["path"]
page_num = pdf_info["page"]
page_label = pdf_info.get("page_label", str(page_num))
doc = self.converter.convert(pdf_path).document
md = doc.export_to_markdown()
documents.append(
Document(
page_content=md,
metadata={"page": page_num, "page_label": page_label},
)
)
os.remove(pdf_path)
return documents
def lazy_load(self) -> Iterator[Document]:
temp_pdfs = self._split_pdf_temporarily()
for pdf_info in temp_pdfs:
pdf_path = pdf_info["path"]
page_num = pdf_info["page"]
page_label = pdf_info.get("page_label", str(page_num))
doc = self.converter.convert(pdf_path).document
md = doc.export_to_markdown()
yield Document(
page_content=md,
metadata={"page": page_num, "page_label": page_label},
)
os.remove(pdf_path)
and then in class Loader
if file_ext == "pdf":
loader = PDFLoaderWrapper(file_path=file_path)
log.info("cool docling nice")
-
I split the PDF into individual pages before processing to retain page label information and other metadata.
-
I leverage Docling’s accelerator options to enable GPU usage, as mentioned in the draft.
-
By using
pipeline_options, I enable:- OCR processing (
do_ocr=True) - Table structure extraction (
do_table_structure=True) - Formula enrichment (
do_formula_enrichment=True) - Code detection (
do_code_enrichment=True)
- OCR processing (
-
Instead of using LangChain’s built-in Docling loader, I directly integrate Docling's API to retain complete metadata and improve flexibility.
-
The OCR engine can be set to EasyOCR or RapidOCR (or other engines), but I chose these specifically because they have a
use_gpu=Trueflag for better performance. -
By adjusting batch size (
settings.perf.elements_batch_size = 3), I can optimize VRAM usage, which is crucial since features like formula enrichment and table structure extraction are resource-intensive. Currently, this is controlled by a single parameter, but it should be configurable separately for each model in the future to further optimize memory usage. For more context, see the related discussion: 🔗 Issue #871 - Comment
Final Step: Markdown Conversion for LLM Processing
To improve processing accuracy and enhance downstream tasks, I convert the document to Markdown at the end. This allows:
- Better LLM comprehension due to structured formatting.
- Additional processing steps like using LlamaIndex's MarkdownElementNodeParser, which can further analyze tables, formulas, and structured elements in detail.
- More flexible post-processing by leveraging Markdown’s structured format for advanced RAG technics .
Future Improvements
- Making options like batch size and OCR settings and the other params configurable
- Enhancing modularity to allow easy switching between different OCR engines.
I use this implementation in my OpenWebUI fork inside Docker in a production environment, and it runs flawlessly.
Closing due to conflicts, feel free to reopen!