paperless icon indicating copy to clipboard operation
paperless copied to clipboard

Consumer stops work on blank page/failed language detection

Open wiertz opened this issue 4 years ago • 3 comments

When scanning a multi-page document including a blank page the consumer throws the following error and does not add the document to the database:

PARSE FAILURE for /consume/test_xp2.pdf: Language detection failed. Set PAPERLESS_FORGIVING_OCR in config file to continue anyway.

I went the docker way on a MacBook and added PAPERLESS_FORGIVING_OCR="true" to the docker-compose.env file – without success. Ideally, if language detection fails, I would expect paperless to proceed and add the document to the collection without OCRed text? Perhaps I am still doing something wrong? Thanks for any hints!

wiertz avatar Apr 23 '20 10:04 wiertz

I worked myself a tiny bit into the code, trying to narrow down the problem: The following section from parsers.py is where the error is raised:

        if not guessed_language or guessed_language not in ISO639:
            self.log("warning", "Language detection failed!")
            if settings.FORGIVING_OCR:
                self.log(
                    "warning",
                    "As FORGIVING_OCR is enabled, we're going to make the "
                    "best with what we have."
                )
                raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
                return raw_text
            error_msg = ("Language detection failed. Set "
                         "PAPERLESS_FORGIVING_OCR in config file to continue "
                         "anyway.")
            raise OCRError(error_msg)
  • I checked the docker-compose.env file and the environment of the running docker instance. "PAPERLESS_FORGIVING_OCR" in both cases is set to "true"
  • I checked the state of settings.FORGIVING_OCR before the above section of code – it is false.
  • Hard coding settings.FORGIVING_OCR to True leads to successful consumption of the document with empty pages.
  • It thus seems the variables in docker-compose.env make it into the instances environment, but not into the settings variable. Any ideas on why?

wiertz avatar Apr 23 '20 20:04 wiertz

OK, I found the issue: I copied entries from paperless.settings.example to docker-compose.env, but overlooked that boolean values are in quotes in the former, but not in the latter. It works for me now, of course, but it may be useful to consider this pitfall in the code:

  • Clearly point this out in docker-compose.env where it says that In addition to what you see here, you can also define any values you find in paperless.conf.example here.
  • Use one consistent way to define booleans in config files
  • Make parsing the settings tolerant to both ways

The first is certainly the easiest, the third likely the most robust way...

wiertz avatar Apr 24 '20 17:04 wiertz

Thank you so much for this writeup, as I was experiencing this exact issue, including seeing that the setting in the .env file wasn't being passed along to the container. Clearly, a quick comment in the env file would help, even if it's not strictly the best long-term solution.

JasonSanDiego avatar Nov 14 '20 18:11 JasonSanDiego