open-webui icon indicating copy to clipboard operation
open-webui copied to clipboard

WIP: docling integration

Open MichaelKarpe opened this issue 9 months ago • 2 comments
trafficstars

Changelog Entry

Description

These are very minimal changes to test docling for experimental purposes, not ready to be released yet. The goal is just to highlight what could be done for integrating docling for contributors who would have time for further testing and implementation. I will not have enough time in the coming days or weeks to build further, please do feel free to build upon this.

  • Approach 1 : using langchain_docling as in the first commit of this PR. I made it work but it used the CPU and not the GPU, leading to unreasonable processing times (a few minutes) for hundreds of pages PDFs. Proposed improvements:

    • find how to use GPU when available
    • warn for processing times for long documents in case of CPU only and/or fallback to default engine
    • handle full offline mode by integrating models into the library, see https://github.com/DS4SD/docling/issues/326
  • Approach 2 : using docker image as Docling team has been recently active in docling-serve (see https://github.com/DS4SD/docling-serve/pkgs/container/docling-serve) and we can probably expect soon a docling docker image as for Apache Tika. Then Ctrl+F of Tika usage in Open WebUI repo and replicate the logic for Docling. Here also to pay attention to CPU/GPU considerations.

Both approaches could possibly be implemented so that the user can choose between both options depending on their needs.

Added

  • [List any new features, functionalities, or additions]

Changed

  • [List any changes, updates, refactorings, or optimizations]

Deprecated

  • [List any deprecated functionality or features that have been removed]

Removed

  • [List any removed features, files, or functionalities]

Fixed

  • [List any fixes, corrections, or bug fixes]

Security

  • [List any new or updated security-related changes, including vulnerability fixes]

Breaking Changes

  • BREAKING CHANGE: [List any breaking changes affecting compatibility or functionality]

Additional Information

  • [Insert any additional context, notes, or explanations for the changes]
    • [Reference any related issues, commits, or other relevant information]

Screenshots or Videos

  • [Attach any relevant screenshots or videos demonstrating the changes]

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • [ ] Target branch: Please verify that the pull request targets the dev branch.
  • [ ] Description: Provide a concise description of the changes made in this pull request.
  • [ ] Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • [ ] Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • [ ] Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • [ ] Testing: Have you written and run sufficient tests for validating the changes?
  • [ ] Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • [ ] Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

MichaelKarpe avatar Feb 02 '25 13:02 MichaelKarpe

Thanks for the work done. Better RAG will need better OCR tools. Docling matches perfectly. Integrating docling will have a huge impact on adoption of openwebui.

I am more aligned with your second approach where we use docling serve. Personnel I have tested the initial implementation here https://github.com/drmingler/docling-api which is more industrialized but it is sure the docling will go this direction .

Could we have input of the official open webui team on how they want to work on docling? https://github.com/open-webui/open-webui/issues/7033

Switching between several OCR sever is a good idea. I do not know if it is best to do it on general settings or in each knowledge based . See https://github.com/open-webui/open-webui/discussions/9361

flefevre avatar Feb 06 '25 06:02 flefevre

I am not a big fan of the LangChain Docling parser because it is too inflexible and rigid. Additionally, I am not sure if all parameters from Docling can be properly set in the LangChain wrapper, which further limits its flexibility. Docling is constantly evolving, and I am not sure if LangChain will implement necessary changes quickly enough. Furthermore, by using the LangChain wrapper, important metadata, such as the page label of the document, is lost.

For these reasons, I have written my own wrapper that inherits from LangChain’s BaseLoader. This ensures seamless integration into the existing LangChain loader and document object structure used in OpenWebUI while still allowing full functionality of Docling to be utilized.

My Implementation


import fitz  # PyMuPDF
import tempfile
import os
from pathlib import Path
from typing import List, Iterator
from langchain_core.documents import Document
from langchain_core.document_loaders import BaseLoader
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    OcrMacOptions,
    PdfPipelineOptions,
    RapidOcrOptions,
    TesseractCliOcrOptions,
    TesseractOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

class PDFLoaderWrapper(BaseLoader):
    

    def __init__(self, file_path: str):
        self.file_path = file_path
        self.converter = self._initialize_converter()

    def _initialize_converter(self):
        from docling.datamodel.settings import settings
        settings.perf.elements_batch_size = 3
        pipeline_options = PdfPipelineOptions()
        pipeline_options.do_ocr = True
        pipeline_options.do_table_structure = True
        pipeline_options.table_structure_options.do_cell_matching = True
        pipeline_options.do_formula_enrichment = True
        pipeline_options.do_code_enrichment= True
        # ocr_options = RapidOcrOptions(force_full_page_ocr=True,use_cuda=True,language='de')
        ocr_options = EasyOcrOptions(force_full_page_ocr=True,use_gpu=True)
        
        pipeline_options.ocr_options = ocr_options

        return DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
            }
        )

    def _split_pdf_temporarily(self) -> List[dict]:
        
        doc = fitz.open(self.file_path)
        temp_files = []

        for page_num in range(len(doc)):
            temp_file = tempfile.NamedTemporaryFile(suffix=".pdf", delete=False)
            temp_path = temp_file.name
            temp_file.close()

            new_doc = fitz.open()
            new_doc.insert_pdf(doc, from_page=page_num, to_page=page_num)
            new_doc.save(temp_path)
            new_doc.close()

            temp_files.append({"page": page_num + 1, "path": temp_path})

        return temp_files

    def load(self) -> List[Document]:
        
        temp_pdfs = self._split_pdf_temporarily()
        documents = []

        for pdf_info in temp_pdfs:
            pdf_path = pdf_info["path"]
            page_num = pdf_info["page"]
            page_label = pdf_info.get("page_label", str(page_num))

           

            doc = self.converter.convert(pdf_path).document
            md = doc.export_to_markdown()
            
            documents.append(
                Document(
                    page_content=md,
                    metadata={"page": page_num, "page_label": page_label},
                )
            )

            os.remove(pdf_path)
            
        
        return documents

    def lazy_load(self) -> Iterator[Document]:
        
        temp_pdfs = self._split_pdf_temporarily()

        for pdf_info in temp_pdfs:
            pdf_path = pdf_info["path"]
            page_num = pdf_info["page"]
            page_label = pdf_info.get("page_label", str(page_num))

            

            doc = self.converter.convert(pdf_path).document
            md = doc.export_to_markdown()

            yield Document(
                page_content=md,
                metadata={"page": page_num, "page_label": page_label},
            )

            os.remove(pdf_path)

and then in class Loader

 if file_ext == "pdf":
                loader = PDFLoaderWrapper(file_path=file_path)
                log.info("cool docling nice")
  • I split the PDF into individual pages before processing to retain page label information and other metadata.

  • I leverage Docling’s accelerator options to enable GPU usage, as mentioned in the draft.

  • By using pipeline_options, I enable:

    • OCR processing (do_ocr=True)
    • Table structure extraction (do_table_structure=True)
    • Formula enrichment (do_formula_enrichment=True)
    • Code detection (do_code_enrichment=True)
  • Instead of using LangChain’s built-in Docling loader, I directly integrate Docling's API to retain complete metadata and improve flexibility.

  • The OCR engine can be set to EasyOCR or RapidOCR (or other engines), but I chose these specifically because they have a use_gpu=True flag for better performance.

  • By adjusting batch size (settings.perf.elements_batch_size = 3), I can optimize VRAM usage, which is crucial since features like formula enrichment and table structure extraction are resource-intensive. Currently, this is controlled by a single parameter, but it should be configurable separately for each model in the future to further optimize memory usage. For more context, see the related discussion: 🔗 Issue #871 - Comment

Final Step: Markdown Conversion for LLM Processing

To improve processing accuracy and enhance downstream tasks, I convert the document to Markdown at the end. This allows:

  • Better LLM comprehension due to structured formatting.
  • Additional processing steps like using LlamaIndex's MarkdownElementNodeParser, which can further analyze tables, formulas, and structured elements in detail.
  • More flexible post-processing by leveraging Markdown’s structured format for advanced RAG technics .

Future Improvements

  • Making options like batch size and OCR settings and the other params configurable
  • Enhancing modularity to allow easy switching between different OCR engines.

I use this implementation in my OpenWebUI fork inside Docker in a production environment, and it runs flawlessly.

JPC612 avatar Feb 13 '25 14:02 JPC612

Closing due to conflicts, feel free to reopen!

tjbck avatar Feb 18 '25 00:02 tjbck