pdfsizeopt icon indicating copy to clipboard operation
pdfsizeopt copied to clipboard

Input PDF on which using Multivalent makes the output much smaller

Open rbrito opened this issue 8 years ago • 3 comments

Dear @pts,

You previously asked to send you files where using Multivalent made some significant change in terms of size and I just found one where it does, considerably.

I will send you the file privately, but here is the output of sizes of the processed files:

$ ls -lgo *.pdf
-rw-r--r-- 1 15215702 Oct  5 15:09 numerical-analysis.pdf
-rw-r--r-- 1 12421574 Oct 10 16:40 numerical-analysis.pso.pdf
-rw-r--r-- 1  7553461 Oct 10 16:40 numerical-analysis.psom.pdf

Note that the file with Multivalent is about half the size of the one generated by pdfsizeopt alone, which is a bit unusual nowadays (files with only pdfsizeopt have started "winning" in terms of size, which is a testament to the quality of your tool!).

Thanks,

Rogério.

rbrito avatar Oct 11 '17 03:10 rbrito

Thank you! Please share more files like this as you find them! They highlight an improvement (extension) opportunity for pdfsizeopt.

Please note that it's impossible for pdfsizeopt (without Multivalent) to win over pdfsizeopt+Multivalent, because the optimizations done by Multivalent don't increase the file size. Thus it can only happen that there is very little additional gain of running Multivalent after pdfsizeopt, but there won't be an blowup. (There may be some occasional exceptions, it would be interesting to find such PDF files.)

This white paper mentions some optimizations Multivalent does: http://multivalent.sourceforge.net/Research/TwoDietPlans.pdf :

  • detects and eliminates duplicate objects. pdfsizeopt also does this.
  • recodes LZW to Flate. pdfsizeopt may not be doing this by default, but it can be done easily, and it should be enabled by default.
  • strips off ASCII encoding. pdfsizeopt may not be doing this by default, but it can be done easily, and it should be enabled by default.
  • collects objects into PDF 1.5 object streams in groups of 200, which are then compressed with Flate. pdfsizeopt also this this (and even more).
  • writes cross-reference table as a compressed cross-reference stream. pdfsizeopt also does this.
  • writes objects in compact syntax. pdfsizeopt also does this.
  • removes old versions of objects. pdfsizeopt does this most of the time (probably always), we need a file with old objects to confirm.
  • removes obsolete objects such as thumbnails and ProcSet. pdfsizeopt doesn't do this yet.
  • inlines small objects such as stream lengths. pdfsizeopt doesn't do this yet.
  • reference counts objects and eliminates unused objects, such as single-use objects that were inlined. pdfsizeopt also does this.
  • omits default values. pdfsizeopt doesn't do this yet, also it doesn't remove unknown keys.
  • shrinks gaps in cross-reference table due to duplicate, inlined or deleted objects. I'm not sure what this means, probably pdfsizeopt is doing something equivalent, because it builds a small cross-reference table from scratch.

pts avatar Oct 11 '17 07:10 pts

Oddly enough, Multivalent is very slow (slower than 10 minutes) for me on numerical-analysis.pdf, possibly it's running in an infinite loop. Could you please attach the console output of time ../pdfsizeopt --use-multivalent=yes --use-pngout=no numerical-analysis.pdf? Please also send me your output file numerical-analysis.psom.pdf. Can you notice any visual difference?

pts avatar Oct 13 '17 01:10 pts

I managed to run Multivalent on numerical-analysis.pdf, and I can reproduce your results. No need to send any files.

pts avatar Oct 13 '17 22:10 pts