DocsGPT
DocsGPT copied to clipboard
🚀 Feature: OCR
🔖 Feature description
I have a suggestion to enable PDF file ingestion with OCR. I am studying the project to use in the legal field. However, many documents are non-searchable text in images, requiring OCR processing to extract the text. In this case, if the number of characters extracted is less than X, it triggers OCR.
🎤 Why is this feature needed ?
I wrote this code, but I am an amateur. I did not consider the issue of speed and performance. It would be interesting if you analyzed and implemented these functionalities in an optimized way to not affect performance. In this case, I thought of a code that checks if the standard text extraction has fewer than X characters. If it does, it means that there is likely an image on that page, triggering the OCR. Does it make sense?
✌️ How do you aim to achieve this?
docs_parser.py
from pathlib import Path from typing import Dict
from application.parser.file.base_parser import BaseParser import fitz # PyMuPDF from pdf2image import convert_from_path import pytesseract from PIL import Image
class PDFParser(BaseParser): """PDF parser with optional OCR support."""
def __init__(self, use_ocr: bool = False, ocr_threshold: int = 10):
"""
Initializes the PDF parser.
:param use_ocr: Flag to enable OCR for pages that don't have enough extractable text.
:param ocr_threshold: The minimum length of text to attempt OCR.
"""
self.use_ocr = use_ocr
self.ocr_threshold = ocr_threshold
def _init_parser(self) -> Dict:
"""Init parser."""
return {}
def parse_file(self, file: Path, errors: str = "ignore") -> str:
"""Parse file."""
text_list = []
pdf = fitz.open(file)
for page_num in range(len(pdf)):
page = pdf.load_page(page_num)
page_text = page.get_text()
# Check if page text is less than the threshold
if self.use_ocr and len(page_text) < self.ocr_threshold:
page_text = self._extract_text_with_ocr(page)
text_list.append(page_text)
text = "\n".join(text_list)
return text
def _extract_text_with_ocr(self, page) -> str:
"""
Extracts text from a PDF page using OCR.
:param page: The PDF page from PyMuPDF.
:return: Extracted text using OCR.
"""
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
ocr_text = pytesseract.image_to_string(img)
return ocr_text
🔄️ Additional Information
No response
👀 Have you spent some time to check if this feature request has been raised before?
- [X] I checked and didn't find similar issue
Are you willing to submit PR?
Yes I am willing to submit a PR!
I am unable to optimize the tool and make a git pull request. The function worked on my computer, but very slowly. If anyone can take on this improvement, I would be grateful. I believe it will be a substantial optimization of the tool, not only for me but for several other usage scenarios.
@dartpain
Appreciate your try @Fagner-lourenco
i want work on this please assign me this task
Thats a great issue to work on @sOnU1002 Thank you, If you have any questions about it please dont hesitate
hey i have impplemented the feature but where o add it
Thats amazing, I suggest you check this part of code https://github.com/arc53/DocsGPT/blob/main/application/worker.py and this one https://github.com/arc53/DocsGPT/blob/main/application/parser/file/docs_parser.py This is where we have our parser logic.
Fix for ModuleNotFoundError: No module named 'pytesseract'
This issue occurs due to the missing pytesseract module, which is required for OCR functionality in the docs_parser.py file. Follow these steps to resolve the issue:
- Install pytesseract module:
Ensure
pytesseractis included in therequirements.txtfile. Run the following command to install it:pip install pytesseract
Install Tesseract OCR: pytesseract depends on Tesseract OCR. Install it on your system as follows:
Ubuntu/Debian: bash Copy code sudo apt-get install tesseract-ocr MacOS (using Homebrew): bash Copy code brew install tesseract Windows: Download and install from the official Tesseract OCR repository. Update CI/CD pipeline: In the CI configuration (e.g., GitHub Actions), add the installation steps:
yaml Copy code
-
name: Install dependencies run: pip install -r requirements.txt
-
name: Install Tesseract OCR (Ubuntu example) run: sudo apt-get install tesseract-ocr Re-run the tests: Execute the following command to ensure the issue is resolved:
bash Copy code python -m pytest --cov=application --cov-report=xml
I suggest you include it into the list of dependencies