Stirling-PDF icon indicating copy to clipboard operation
Stirling-PDF copied to clipboard

[Bug]: Secondary OCR Language not showing up

Open VilterPD opened this issue 1 year ago • 3 comments

The Problem

Hi dear StirlingPDF Team, lovely service, I use it for everything.

I'm just having a Problem with a secondary OCR Language not showing up. i have added the deu.trainingdata, and tesseract is working:

docker exec -it _redacted_ tesseract --list-langs [DS] Profile read from file (tesseract_opencl_profile_devices.dat). [DS] Device[1] 0:(null) score is 0.240261 [DS] Selected Device[1]: "(null)" (Native) List of available languages in "/data/tessdata/" (2): deu eng

I tried it out inside the container, and it did work correctly into .txt, including Umlauts.

But inside of stirlingPDF only english shows up.

I believe I followed every step in How to use OCR

my compose/stack config is below, everthing seems to be in order, and I did not find an env Variable I would have to change

I'm sure its a simple fix I'm overlooking. I'm running in portainer as a stack, App-Version: 0.27.0 Screenshot 2024-08-09 at 11-53-45 Trauminsel Reisen 🌴 PDF - OCR _ Scan-Bereinigung

Version of Stirling-PDF

0.27.0

Last Working Version of Stirling-PDF

0.27.0

Page Where the Problem Occurred

https://pdf.trauminselreisen.de

Docker Configuration

`version: '3.3'
services:
  stirling-pdf:
    image: frooodle/s-pdf:latest
    ports:
      - '82:8080'  # Port-Mapping aus den Containerinformationen übernommen
    volumes:
      - stirling:/data
      - /home/phil/tessdata:/data/tessdata
    environment:
      - DOCKER_ENABLE_SECURITY=false
      - INSTALL_BOOK_AND_ADVANCED_HTML_OPS=true
      - LANGS=de_DE
      - CUSTOM_FILES_DIR=/data/customFiles
      - UI_APP_NAME=Trauminsel Reisen 🌴 PDF
      - UI_HOME_DESCRIPTION=Alle PDF Tools von Trauminsel Reisen 🌴
      - UI_APP_NAVBAR_NAME= PDF 🌴 Trauminsel Reisen
      - TESSDATA_PREFIX=/data/tessdata
      - CONFIGS_DIR=/data/configs
      - JAVA_TOOL_OPTIONS=-XX:MaxRAMPercentage=75
      - APP_LOCALE=de_DE
      - SYSTEM_DEFAULT_LOCALE=de-DE
      - BASE_URL=https://pdf.trauminselreisen.de
    entrypoint:
      - tini
      - --
      - /scripts/init.sh
    command: ["java", "-Dfile.encoding=UTF-8", "-jar", "/app.jar"]
volumes:
  stirling:

Relevant Log Output

08:30:24.649 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - xref 11: treating as an optimization candidate

08:30:25.388 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - XrefExt(xref=11, ext='.png')

08:30:25.388 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - Optimizable images: JPEGs: 0 PNGs: 1

08:30:25.389 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor - 

08:30:25.389 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-XUNRAmg1o3nrUv2lQSIpmQ in page 0

08:30:25.389 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - xref 11: treating as an optimization candidate

08:30:25.390 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - xref 11: marking this JPEG as deflatable

08:30:25.396 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor - 

08:30:25.397 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - Recursing into Form XObject /OCR-XUNRAmg1o3nrUv2lQSIpmQ in page 0

08:30:25.397 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - xref 11: treating as an optimization candidate

08:30:25.397 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - xref 11: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization

08:30:25.398 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.optimize - Optimizable images: JBIG2 groups: 0

08:30:25.398 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor - 

08:30:25.402 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.i3_3_8x0/optimize.opt.pdf, /tmp/ocrmypdf.io.i3_3_8x0/optimize.pdf)

08:30:25.402 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Running: ['jbig2', '--version']

08:30:25.465 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Running: ['pngquant', '--version']

08:30:25.466 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -    INFO ocrmypdf._pipeline - Image optimization ratio: 3.45 savings: 71.0%

08:30:25.466 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -    INFO ocrmypdf._pipeline - Total file size ratio: 3.25 savings: 69.2%

08:30:25.466 [Thread-9] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf._pipeline - /tmp/ocrmypdf.io.i3_3_8x0/optimize.pdf -> /tmp/output_3809299846436271993.pdf

Additional Information

No response

Browsers Affected

Firefox, Chrome, Other

No Duplicate of the Issue

  • [X] I have verified that there are no existing issues raised related to my problem.

VilterPD avatar Aug 09 '24 10:08 VilterPD

home/phil/tessdata:/data/tessdata

Your path is wrong, all our docs show it as /usr/share/tessdata

Frooodle avatar Aug 09 '24 10:08 Frooodle

That was fast. Thanks.

I moved it over to /usr/share/tessdata, same. Manual works (docker exec -it 7a45c9990406 tesseract /data/Briefpapier.png output -l deu)

But it doesnt show up in the GUI

PS: I restarted the stack, of course

VilterPD avatar Aug 09 '24 10:08 VilterPD

Can you share a screenshot of the /usr/share/tessdata directory inside docker container I want to know it's contents

Frooodle avatar Aug 09 '24 10:08 Frooodle

Sure, thanks for looking into it Screenshot 2024-08-12 114426

Edit: Sorry, you wrote inside the container:

Screenshot 2024-08-12 123501

VilterPD avatar Aug 12 '24 09:08 VilterPD

Ok, so the answer is:

I had tessdata set to the wrong folder inside the container.

It has to be in /usr/share/tessdata inside the container, which I didn't get into my thick skull. After moving it there, everything works fine.

Thank you.

VilterPD avatar Aug 12 '24 10:08 VilterPD