docspell icon indicating copy to clipboard operation
docspell copied to clipboard

Error Generating Preview Images for PDF/A Files in docspell 0.41.0

Open ElektroCoder opened this issue 1 year ago • 8 comments

Hello,

I encountered an issue with processing PDF/A files in docspell version 0.41.0 on Debian 12. Attempting to generate preview images results in an error specifically for colored PDF/A files, whereas standard PDF files are processed without any issues. Here are the relevant log entries:

[...] Sun, 4 February 2024, 19:45: Creating preview images for 1 files… Sun, 4 February 2024, 19:45: Creating preview images failed, continuing without it.: LCMS error 13: Mismatched alpha channels Sun, 4 February 2024, 19:45: Retrieving page count for 1 files… [...]

I would greatly appreciate any assistance or suggestions on how to resolve this issue.

ElektroCoder avatar Feb 04 '24 19:02 ElektroCoder

Hi @ElektroCoder I probably need such a pdf to check it on my side. Do you perhaps have some test file without sensitive stuff? Do you know if the same file works in the/a previous version?

eikek avatar Feb 04 '24 22:02 eikek

Hi,

Thank you for getting back to me on this. It’s no problem at all; I can provide a test file without any sensitive information. Regarding your question, I’ve never encountered any issues with version 0.40. The files are copied over from a document scanner to opt/docs via a Samba share. I’ve noticed that when I save the files as PDF/A, no preview is generated. However, if I adjust the scanner settings to save them as standard PDFs, the preview works fine.

I’ll follow up with more information and possibly a test file by tomorrow around 4:00 PM—I’m already in bed for the night. :)

Am 04.02.2024 um 23:52 schrieb eikek @.***>:

I probably need such a pdf to check it on my side. Do you perhaps have some test file without sensitive stuff? Do you know if the same file works in the/a previous version?

ElektroCoder avatar Feb 04 '24 23:02 ElektroCoder

Hi, oh sure, there is absolutely no rush. Just take your time - however long that may take.

eikek avatar Feb 05 '24 07:02 eikek

Hi,

sorry for the delay. I took some time to retest things after double-checking my AMD GPU drivers on Debian and reinstalling Docker and Docspell. I've got two PDFs for you, both scanned with a Brother ADS 2400N scanner. One is in (not working) PDF/A format and the other in standard PDF format. They were saved via a Samba share, which has been working smoothly.

I've never had any issues with Docspell 0.40.0 before. However, I recently upgraded my hardware from an old A3000 CPU to an AMD 5600G CPU, and I'm running everything on a Debian 12 terminal server. the import process log has this entry:

[...] Sun, February 11th, 2024, 10:32: Updating SOLR index Sun, February 11th, 2024, 10:32: Text extraction finished in 46630 ms. Sun, February 11th, 2024, 10:32: Creating preview images for 1 files… Sun, February 11th, 2024, 10:32: Creating preview images failed, continuing without it.: LCMS error 13: Mismatched alpha channels Sun, February 11th, 2024, 10:32: Retrieving page count for 1 files… Sun, February 11th, 2024, 10:32: Found number of pages: 2 [...]

I'll include the log files as text files. I'm not sure what's causing the problem; everything seems to be functioning fine, and Portainer isn't showing any entries in the container logs.

Thanks for your help in advance.


failed_Scan_20240211_113131_004873.pdf log_004873_failedPreview_Brother_ADS-2400N_PDF-A.txt log_004875_workingPreview_Brother_ADS-2400N_PDF.txt ok_Scan_20240211_113214_004875.pdf

ElektroCoder avatar Feb 11 '24 10:02 ElektroCoder

I just got the same error in generating the preview for a file. I'm running docspell inside Kubernetes, but I don't think that's the issue.

TheAnachronism avatar Feb 15 '24 16:02 TheAnachronism

I also get this a bit before the preview fails:

Thu, February 15th, 2024, 16:45: PDF conversion failed: Command result=3. No output file found.. Go without PDF file

TheAnachronism avatar Feb 15 '24 16:02 TheAnachronism

Hi @ElektroCoder @TheAnachronism

I read your output and also noticed that in the log. Sun, February 11th, 2024, 11:28: PDF conversion failed: Command result=3. No output file found.. Go without PDF file

I the filenames of your working preview have PDF in the file name, and the failed preview has PDF/A in the file name.

This tells me that potentially PDF/A conversion is the culprit here.

Could you try the following? For scanning this PDF, let's try editing your ocrmypdf configuration a bit. In the /etc/docspell-joex/docspell-joex.confconfig try adding "--output-type", "pdf", to the options (this should come after --skip-text) and then go ahead and restart docspell-joex.

     # The `--skip-text` option is necessary to not fail on "text" pdfs
    # (where ocr is not necessary). In this case, the pdf will be
    # converted to PDF/A.
    ocrmypdf = {
      enabled = true
      command = {
        program = "ocrmypdf"
        args = [
          "-l", "{{lang}}",
          "--skip-text",
          "--deskew",
	  "--output-type", "pdf",
          "-j", "1",
          "{{infile}}",
          "{{outfile}}"
        ]

After editing so it appears similar to the excerpt above, restart docspell-joex.

sudo systemctl restart docspell-joex or use equivalent commands on docker.

Try reprocessing (delete the failed one, and any intermediary or cached filed created from scanning in the original document) and send the log over?

It would be good to know if using "--output-type", "pdf", was a better default than PDF/A. @eikek potentially similar to issue #2504 for affected PDFs.

PDF/A is meant to be archived as is, so even though it's counterintuitive since we want to manage documents, converting to raw PDF for processing may be better for Docspell.

tenpai-git avatar Feb 21 '24 11:02 tenpai-git

Hey guys, maybe try upgrading to nightly 0.4.2 version? I don't use SOLR, I am using PostgreSQL, but my previews were not generating on certain things also.

I tried upgrading to nightly on whim, and that resolved it for me. Perhaps there is a dependency issue of some kind.

Curious to see if the other test suggested works out for you as well. Adding "--output-type", "pdf", as previously described fixed things in a lot of pdfs I was working with, including previews.

tenpai-git avatar Feb 22 '24 15:02 tenpai-git

Hi! I wonder if that issue is also related to #2504 (as mentioned already by @tenpai-git above). The docker images have been updated (sadly reusing the same tags as before) - maybe you could given them a try?

eikek avatar Mar 02 '24 22:03 eikek

@ElektroCoder I tested your "failed scan" document quickly at my 0.39.0 installation. It was all good. I have preview and can select text in the converted pdf. I would assume for now some tooling problems, because I don't recall any changes in code from that version to 0.41.0 in that area. (I'm not using the docker images)

eikek avatar Mar 02 '24 22:03 eikek

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. This only applies to 'question' issues. Always feel free to reopen or create new issues. Thank you!

github-actions[bot] avatar Apr 11 '24 02:04 github-actions[bot]

I just hit the same problem, and this workaround ^1 fixes it - simply add -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true

vs49688 avatar May 19 '24 09:05 vs49688