papermerge icon indicating copy to clipboard operation
papermerge copied to clipboard

Russian and Kazakh OCR

Open Sergey-alm opened this issue 1 year ago • 3 comments

Hello! I have installed Russian and Kazakh OCR languages, but papermerge does not work with them. The gray circle is after processing and the search does not search for Russian/Kazakh words.

Info:

  • Papermerge Version 3.2

Sergey-alm avatar Aug 15 '24 10:08 Sergey-alm

I have implemented support for Russian and Kazakh OCR languages in my own setup, and everything is working fine. In the real world, you need to do a little more than what is described in the documentation, so here’s a step-by-step guide on how I achieved this

  1. Create your custom OCR docker image:

first, you need to create your own OCR worker image to include the necessary languages. Create a Dockerfile based on the existing papermerge/ocrworker:0.3.1 image and install the required OCR language packages:

FROM papermerge/ocrworker:0.3.1
# Add the required languages here
RUN apt update && apt install -y tesseract-ocr-kaz tesseract-ocr-rus
  1. Verify the languages in the OCR worker:

once the docker image is built and the ocr worker is running verify that the languages are installed: docker exec -it <ocr_worker_docker_container_id> tesseract --list-langs

  1. Add the language support in the Papermerge codebase:
  • update OCR task schema: In the papermerge/core/features/tasks/schema.py file, add the new language codes to the LangCode type
LangCode = Literal[
    "ces",
    "dan",
    "deu",
    "ell",
    "eng",
    "fas",
    "fin",
    "fra",
    "guj",
    "heb",
    "hin",
    "ita",
    "jpn",
    "kor",
    "lit",
    "nld",
    "nor",
    "pol",
    "por",
    "ron",
    "san",
    "spa",
    # add additional languages here
    "kaz",
    "rus",
]
  • Update UI Constants: In the ui2/src/cconstants/ts file, add required language names:
export const OCR_LANG: OCRLangType = {
    ces: "Čeština",
    dan: "Dansk",
    deu: "Deutsch",
    ell: "Ελληνικά",
    eng: "English",
    fin: "Suomi",
    fra: "Français",
    guj: "ગુજરાતી",
    heb: "עברית",
    hin: "हिंदी",
    ita: "Italiano",
    jpn: "日本語",
    kor: "한국어",
    lit: "Lietuvių",
    nld: "Nederlands",
    nor: "Norsk",
    osd: "Osd",
    pol: "Polski",
    por: "Português",
    ron: "Română",
    san: "संस्कृत",
    spa: "Español",
    // Add additional languages here
    kaz: "Қазақша",
    rus: "Русский",
};
  • Update OCRCode Type: In the ui2/src/types.ts and ui2/src/types/ocr.ts files, extend the OCRCode type:
export type OCRCode = 
    | "ces" | "dan" | "deu" | "ell" | "eng" | "fin" | "fra" | "guj" | "heb"
    | "hin" | "ita" | "jpn" | "kor" | "lit" | "nld" | "nor" | "osd" | "pol"
    | "por" | "ron" | "san" | "spa"
    // Add additional languages here
    | "kaz" | "rus";
  1. Build custom Papermerge image:
docker buildx build --platform linux/amd64 -t myimage:0.0.1 -f docker/standard/Dockerfile .
  1. Run Papermerge with the custom OCR worker

bl1nkker avatar Apr 02 '25 11:04 bl1nkker

@ciur, i just wanted to point out that while the process for adding OCR languages in Papermerge is generally straightforward (which I really appreciate), it currently requires a few extra steps that aren't mentioned in the documentation

it would be great if the documentation could be updated to include these steps

bl1nkker avatar Apr 02 '25 12:04 bl1nkker

@bl1nkker thank you for nicely organized guide. I've added it as part of documentation

ciur avatar Apr 02 '25 19:04 ciur