ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Feature Request]: Requesting integration of widely used Tesseract OCR

Open vishaldwdi opened this issue 1 year ago β€’ 1 comments

Is there an existing issue for the same feature request?

  • [X] I have checked the existing issues.

Is your feature request related to a problem?

I am aware that it has deepdoc, but would like to request integration of extremely popular and widely used Tesseract OCR which supports more than 100 languages.

Easily applicable.

Describe the feature you'd like

Requesting Easily Implementable Feature Enhancement to achieve this Workflow, so that life of Researchers, Investigators, Officials, and PhD Students could improve

(I'm adding this through the eyes of researcher, I'm aware that certain capabilities may already have been implemented in some way or form, I'm also aware that some part of this was requested earlier, my the goal is to make end product much more cohesive considering upcoming University Season)

# High Level Workflow__

Step 1: Add Document for RAG - User uploads a document (e.g., PDF, image, or text file) to the system. RagFlow performs RAG to store document in a respective database.

Step 2: RagFlow checks if Document Requires OCR RagFlow analyzes the document to determine if it requires OCR (Optical Character Recognition). - If the document is an image or scanned PDF, it likely requires OCR.

Step 3: OpenCV + Pillow Preprocessing prior OCR - If OCR is required, RagFlow utilizes Tesseract OCR with OpenCV and Pillow preprocessing to extract text from the document. The extracted data is then stored to improve respective database.

( I have personally tested that OpenCV+Pillow Preprocessing prior Tesseract improves complex text recognition by 52% while supporting more than 100 languages ).

Step 4: Database Improvement - If OCR was required, the extracted text is used to improve the database. If OCR was not required, RagFlow uses its inbuilt capabilities to improve the database with the uploaded document.

Step 5: User Enters Query - The user enters a query or question.

Step 6: Database Search and Web Search (if database insufficient) RagFlow searches the database to satisfy the user's query. - If the database search yields insufficient results, RagFlow utilizes a web search API (e.g., Google Custom Search or SearXNG) to fetch relevant results. The web search results are then stored in the database.

Step 7: RagFlow Processing - RagFlow processes the query using its LLM models accessed through APIs. The LLM models generate a response based on the database search and web search results.

Step 8: Response Generation - RagFlow generates a response to the user's query, utilizing the processed information. This workflow integrates OCR, web search, and LLM capabilities to provide accurate and up-to-date responses to user queries.

Reference: https://github.com/ItzCrazyKns/Perplexica https://github.com/tesseract-ocr/tesseract https://pypi.org/project/opencv-python/ https://pillow.readthedocs.io/en/stable/

Describe implementation you've considered

No response

Documentation, adoption, use case

No response

Additional information

No response

vishaldwdi avatar Aug 12 '24 01:08 vishaldwdi

+1 I'm also intefesting of this feature.

netandreus avatar Aug 20 '24 09:08 netandreus

@vishaldwdi Thanks for your thoughtful suggestion β€” and apologies for the delayed reply! ⏳

Currently, our deepdoc module uses OCR and layout detection models from πŸ€— Hugging Face, rather than Tesseract. We've found these models offer strong performance across a variety of documents πŸ“„βœ¨. That said, we’re definitely open to exploring enhancements!

If you have specific advantages or use cases where Tesseract might outperform, we’d love to hear more so we can evaluate the trade-offs more clearly πŸ€“πŸ”

Appreciate your input β€” feel free to keep the ideas coming! πŸš€

which-W avatar May 16 '25 08:05 which-W