Russian and Kazakh OCR
Hello! I have installed Russian and Kazakh OCR languages, but papermerge does not work with them. The gray circle is after processing and the search does not search for Russian/Kazakh words.
Info:
- Papermerge Version 3.2
I have implemented support for Russian and Kazakh OCR languages in my own setup, and everything is working fine. In the real world, you need to do a little more than what is described in the documentation, so here’s a step-by-step guide on how I achieved this
- Create your custom OCR docker image:
first, you need to create your own OCR worker image to include the necessary languages. Create a Dockerfile based on the existing papermerge/ocrworker:0.3.1 image and install the required OCR language packages:
FROM papermerge/ocrworker:0.3.1
# Add the required languages here
RUN apt update && apt install -y tesseract-ocr-kaz tesseract-ocr-rus
- Verify the languages in the OCR worker:
once the docker image is built and the ocr worker is running verify that the languages are installed:
docker exec -it <ocr_worker_docker_container_id> tesseract --list-langs
- Add the language support in the Papermerge codebase:
- update OCR task schema:
In the
papermerge/core/features/tasks/schema.pyfile, add the new language codes to the LangCode type
LangCode = Literal[
"ces",
"dan",
"deu",
"ell",
"eng",
"fas",
"fin",
"fra",
"guj",
"heb",
"hin",
"ita",
"jpn",
"kor",
"lit",
"nld",
"nor",
"pol",
"por",
"ron",
"san",
"spa",
# add additional languages here
"kaz",
"rus",
]
- Update UI Constants:
In the
ui2/src/cconstants/ts file, add required language names:
export const OCR_LANG: OCRLangType = {
ces: "Čeština",
dan: "Dansk",
deu: "Deutsch",
ell: "Ελληνικά",
eng: "English",
fin: "Suomi",
fra: "Français",
guj: "ગુજરાતી",
heb: "עברית",
hin: "हिंदी",
ita: "Italiano",
jpn: "日本語",
kor: "한국어",
lit: "Lietuvių",
nld: "Nederlands",
nor: "Norsk",
osd: "Osd",
pol: "Polski",
por: "Português",
ron: "Română",
san: "संस्कृत",
spa: "Español",
// Add additional languages here
kaz: "Қазақша",
rus: "Русский",
};
- Update OCRCode Type:
In the
ui2/src/types.tsandui2/src/types/ocr.tsfiles, extend the OCRCode type:
export type OCRCode =
| "ces" | "dan" | "deu" | "ell" | "eng" | "fin" | "fra" | "guj" | "heb"
| "hin" | "ita" | "jpn" | "kor" | "lit" | "nld" | "nor" | "osd" | "pol"
| "por" | "ron" | "san" | "spa"
// Add additional languages here
| "kaz" | "rus";
- Build custom Papermerge image:
docker buildx build --platform linux/amd64 -t myimage:0.0.1 -f docker/standard/Dockerfile .
- Run Papermerge with the custom OCR worker
@ciur, i just wanted to point out that while the process for adding OCR languages in Papermerge is generally straightforward (which I really appreciate), it currently requires a few extra steps that aren't mentioned in the documentation
it would be great if the documentation could be updated to include these steps
@bl1nkker thank you for nicely organized guide. I've added it as part of documentation