archive-pdf-tools icon indicating copy to clipboard operation
archive-pdf-tools copied to clipboard

pillow is not working properly

Open Redsandro opened this issue 3 years ago • 27 comments

Using -J pillow results in a terrible images. It looks like the image is resampled 4 to 1.

recode_pdf -v --dpi 300 \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow.pdf

Here is the -J pillow foreground layer: pillow

For comparison, here is -J kakadu: kakadu

The resulting files are approximately similar in size. Is pillow really absurdly bad, or does it need to get different compression parameters? I wanted to try this out, recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

Test scan to experiment with

test_1.png.zip

Suggested actionables

  • [x] Use sane defaults for pillow so quality is reasonable.
  • [ ] Show clear distinct error message so user doesn't get ambiguous ValueError when following the docs.
  • [ ] Update documentation with Pillow compression flags.

Redsandro avatar Feb 20 '22 16:02 Redsandro

What version did you try this with? I recently updated some of the compression parameters to be more in line with the kakadu ones. Could you retry with the latest version?

Pillow should be the same as openjpeg.

MerlijnWajer avatar Apr 03 '22 07:04 MerlijnWajer

I think Kakadu is doing a better job adopting to the input images, at least with my default parameters. It's just a standard reduction, whereas I think Kakadu might do something more clever. You could experiment with other values like the -q flags, instead of -r.

MerlijnWajer avatar May 02 '22 18:05 MerlijnWajer

If you use the build from issue #41 you could toy around some with it, but I tried again to use a single value for q and it ends up real ugly at the same filesize as kakadu. I agree the foreground layer could be better - but does it make a big difference in the mrc-combined final result?

MerlijnWajer avatar May 02 '22 18:05 MerlijnWajer

Thanks for the tip. I will experiment more with the latest version when I have a moment and let you know, although the previous reduction I observed does not make me confident there will be any interesting results.

does it make a big difference in the mrc-combined final result?

You mean optically speaking, right?

I will get back to you.

Redsandro avatar May 02 '22 20:05 Redsandro

Right, I meant if the resulting PDF optically looks much worse. kakadu definitely seems to be better, but there's probably ways to make OpenJPEG better, I just haven't invested a lot of time in trying all the different knobs.

MerlijnWajer avatar May 02 '22 20:05 MerlijnWajer

I tried using internetarchivepdf 1.4.13 and verified that -J pillow looks very bad by default. Not 'a bit worse', but extremely bad. The mask makes the text readable, but the colors are smudged. There is hardly any high frequency data at all.

If it was simply super compressed, there would be a use case somewhere for someone, but it has about the same compression ratio as kakadu so it makes you wonder: How does pillow waste so much space if it doesn't show any detail beyond low frequency smudges?

The initially reported problem still exists, so I cannot use -r or experiment with -q.

recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

The error message does not give me a clue about the problem. I'm using different variants, but adding the space is what the documentation suggests.

Do you get similar results or is it just me? I case of the former, if pillow can be tweaked to look half decent, I would suggest adding some pillow-specific defaults. If not, I'd give pillow a label or warning message: "Bad quality, for testing purposes only."

Redsandro avatar May 04 '22 18:05 Redsandro

If not, I'd give pillow a label or warning message: "Bad quality, for testing purposes only."

Can you please state which version of Pillow you are using?

python3 -m pip show pillow |grep Version

If recent versions of Pillow do not provide reasonable JP2 quality, perhaps someone should file an issue requesting that they improve their encoder?

mara004 avatar May 04 '22 18:05 mara004

I think Pillow uses OpenJPEG so that might not help. I think we can get better quality with Pillow/OpenJPEG and Grok, but I just didn't invest the time in trying to find the right flags. Maybe see what happens with multi-layer encoding, as the help options also suggest?

MerlijnWajer avatar May 04 '22 18:05 MerlijnWajer

Can you please state which version of Pillow you are using?

$ python3 -m pip show pillow | grep Version
Version: 8.3.2

I think Pillow uses OpenJPEG so that might not help.

Once I get #41 working I can do some comparisons.

I think we can get better quality with Pillow/OpenJPEG and Grok

I was interested in Grok because it sounds promising, but I couldn't get Grok to build or install on Ubuntu, so that's the one I haven't tried yet.

Maybe see what happens with multi-layer encoding, as the help options also suggest?

Could you show me where exactly I can read about this?

Redsandro avatar May 04 '22 19:05 Redsandro

I was thinking of this:

-r <compression ratio>,<compression ratio>,...
    Different compression ratios for successive layers.
    The rate specified for each quality level is the desired
    compression factor (use 1 for lossless)
    Decreasing ratios required.
      Example: -r 20,10,1 means
            quality layer 1: compress 20x,
            quality layer 2: compress 10x
            quality layer 3: compress lossless
    Options -r and -q cannot be used together.
-q <psnr value>,<psnr value>,<psnr value>,...
    Different psnr for successive layers (-q 30,40,50).
    Increasing PSNR values required, except 0 which can
    be used for the last layer to indicate it is lossless.
    Options -r and -q cannot be used together.

MerlijnWajer avatar May 04 '22 19:05 MerlijnWajer

The error message does not give me a clue about the problem. I'm using different variants, but adding the space is what the documentation suggests.

Right, so the flags for Pillow are unfortunately different. For Pillow you can do this:

quality_mode:"rates";quality_layers:[500]

MerlijnWajer avatar May 04 '22 19:05 MerlijnWajer

You can see all the supported flags here: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#jpeg-2000

MerlijnWajer avatar May 04 '22 20:05 MerlijnWajer

Thank you @MerlijnWajer this helps.

Right, so the flags for Pillow are unfortunately different. For Pillow you can do this: quality_mode:"rates";quality_layers:[500]

It works! I'm not getting the error. No spaces allowed. So to address the second part of the initial issue, perhaps you can catch ValueError for all implementation dependent compression flags, and output an error message something like this:

Invalid compression flags for {implementation}.

Turns out pillow is just really quite bad at lower quality settings but cleans up with some better quality. To me it becomes acceptable at around 220:

recode_pdf -v --dpi 300 -J pillow \
  --fg-compression-flags 'quality_layers:[220]' \
  -I in.png --hocr-file in.hocr -o out-pillow-r220.pdf

So to address the first part of the initial issue, you could set these as the default fg flags if the user doesn't set otherwise, so users won't think it's broken like I did. :sweat_smile:

pillow default: :-1: image

pillow quality_layers:[220]: :+1: image

kakadu default: :+1: image

Redsandro avatar May 05 '22 19:05 Redsandro

I wanted to do a simple PR for the help output, but I'm not sure how so I've added a 3rd checkbox to the initial issue in stead.

Right now recode_pdf tells us:

Default for kakadu is '-slope 44250',default for grok/openjpeg is '-r 500'. Pass with quoted and with a space at the start:' --flag value'

There is no space at the start in the examples, and with pillow the space causes the error. I think this text is outdated.

Suggestion:

Defaults are kakadu: '-slope 44250'; grok/openjpeg: '-r 500'; pillow: 'quality_layers:[220]'. Pass with quotes.

Redsandro avatar May 05 '22 19:05 Redsandro

Right, pillow flags aren't documented there and those should not start with a space. The thing with the space is that if you do something like --bg-compression-flags '--this-flag', then Python starts parsing --this-flag as a flag. That's why the quotes with a space are required - there's no easy way around that unfortunately.

Regarding the default pillow/openjpeg flags, could you compare the filesizes? My suspicion is that now the resulting PDFs will be quite a bit larger than the kakadu ones. I tried to have similar file sizes, rather than similar quality (which I agree might not have been the best idea).

MerlijnWajer avatar May 05 '22 21:05 MerlijnWajer

if you do something like --bg-compression-flags '--this-flag', then Python starts parsing --this-flag as a flag. That's why the quotes with a space are required

Oh now I get it! It's exclusively for double dashes. That's why -slope and -r work fine, and documented without space, making the instructions unclear for people who didn't know this (such as myself).

Regarding the default pillow/openjpeg flags, could you compare the filesizes?

Yes you are correct, 145kb kakadu size vs 210kb pillow size. I understand the rationale for targeting the same size. It's just that pillow doesn't perform acceptably at such low quality, so without usable defaults the user will always have to figure out how to change the default.

Redsandro avatar May 06 '22 01:05 Redsandro

Yes you are correct, 145kb kakadu size vs 210kb pillow size. I understand the rationale for targeting the same size. It's just that pillow doesn't perform acceptably at such low quality, so without usable defaults the user will always have to figure out how to change the default.

Ok, that is fair enough, I guess that's a sensible reasoning. Reminds me again that maybe having some "compression profiles" makes sense, so like:

  • standard: where kakadu/pillow/openjpeg look the same, but do not have the same file sizes
  • kakadu-roi-standard: as above, but kakadu only, with roi
  • aggressive: really agressive compression
  • quality: quality over compression (mostly)

And there could also be profiles for specific content, like:

  • books
  • comicbooks
  • scanned film material
  • etc

MerlijnWajer avatar May 06 '22 07:05 MerlijnWajer

Oh now I get it! It's exclusively for double dashes. That's why -slope and -r work fine, and documented without space, making the instructions unclear for people who didn't know this (such as myself).

https://stackoverflow.com/questions/16174992/cant-get-argparse-to-read-quoted-string-with-dashes-in-it

Rereading this thread, I think the better solution is to use --bg-compression-flags='--foo'

I will test this, update the documentation, and remove the space stripping hack.

MerlijnWajer avatar May 06 '22 08:05 MerlijnWajer

Yeah, that works:

recode_pdf --dpi 300 --bg-compression-flags='-q 25' --fg-compression-flags='-q 26' -J openjpeg -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out-openjpeg.pdf

MerlijnWajer avatar May 06 '22 08:05 MerlijnWajer

Reminds me again that maybe having some "compression profiles" makes sense

Absolutely. It saves users a lot of time. When the h264 encoder got presets (fast, slow etc) and content profiles (grainy film, cartoon etc) it became a lot more pleasurable to use.

Takes a lot of effort to setup though. So if the default makes sense, that's a good start. You may want to take time to hone presets and keep it undocumented until you are happy with the result.

Redsandro avatar May 06 '22 20:05 Redsandro

I have created an issue for this feature request https://github.com/internetarchive/archive-pdf-tools/issues/48

If I add the flags that you recommend, then I think we can close this bug, right?

Maybe we should ask for help on the openjpeg mailing list - they might have some tips/advice.

MerlijnWajer avatar May 07 '22 12:05 MerlijnWajer

I have created an issue for this feature request #48

If I add the flags that you recommend, then I think we can close this bug, right?

Here are my recommended tasks from the original issue. Feel free to close this issue.

Suggested actionables

  • [x] Use sane defaults for pillow so quality is reasonable. (https://github.com/internetarchive/archive-pdf-tools/issues/48)
  • [ ] Show clear distinct error message so user doesn't get ambiguous ValueError when following the docs.
  • [ ] Update documentation with Pillow compression flags. (https://github.com/internetarchive/archive-pdf-tools/issues/42#issuecomment-1118989445)

(with "documentation" I actually meant recode_pdf --help)

Redsandro avatar May 07 '22 14:05 Redsandro

@MerlijnWajer is it possible to re-recode pdf's that were done using pillow with kakadu? Not that it would increase quality, but the images are a lot bigger while at the same time so terrible that the only thing that saves them is the mask.

I think re-doing them with kakadu may remove half the filesize with minimal to no quality loss since the images are already so blurry.

Redsandro avatar May 25 '22 13:05 Redsandro

You could try to render them to a page (combining the MRC into a normal page), and then recompressing them. I don't have a tool to do this exact thing, but it should not be too hard with pymupdf. Maybe mutool can just render the final pages to images, and then you can try to recompress them.

MerlijnWajer avatar May 26 '22 14:05 MerlijnWajer

Thank you for pointing me in the right direction. I think I should keep the mask as generated by recode-pdf , and just re-encode the pillow jp2 with kakadu externally. PyMuPDF can indeed do just that: PyMuPDF-Utilities/image-replacement. It's slightly more complicated than it sounds though. I need to read up on the xrefs to understand what the example is doing, or find a tool that automates the xref business so I can just take care of re-encoding all jp2 images, preferably including some way to distinguish between what recode-pdf intended to be foreground and background.

Redsandro avatar May 26 '22 15:05 Redsandro

is it possible to re-recode pdf's that were done using pillow with kakadu? Not that it would increase quality, but the images are a lot bigger while at the same time so terrible that the only thing that saves them is the mask.

Provided you still have the original input data, wouldn't it be better to use that directly to avoid additional quality loss?

mara004 avatar May 26 '22 16:05 mara004

Provided you still have the original input data, wouldn't it be better to use that directly to avoid additional quality loss?

Absolutely, always, 100%.

But the original data is gone and so is the analog paper. It's just that the pillow images are so inconceivably bad compared to their file size, it's like 150kb per page. I think those jp2 can be re-compressed with kakadu at a third their size with hardly a quality loss because pillow turned it into blurry smudges.

Maybe the end result is 85 kb versus 150 kb but it adds up if you have scanned and destroyed a lot of material before realizing kakadu was not used by default when available.

Redsandro avatar May 26 '22 16:05 Redsandro