gutenberg icon indicating copy to clipboard operation
gutenberg copied to clipboard

Book HTML images are not always sufficiently compressed

Open benoit74 opened this issue 1 month ago • 3 comments

In order to solve https://github.com/openzim/gutenberg/issues/288, we've decided to stop compressing (optimizing) images on our own, especially images used in HTML books.

Recent runs and analysis done in https://github.com/openzim/gutenberg/issues/374 proved that optimization we were doing on images was not that useless.

Two examples below:

2023-08 2025-10
Size 63k 101k
Preview Image Image
Online at https://dev.library.kiwix.org/content/gutenberg_de_all_2023-08/53217_fig1.jpg https://browse.library.kiwix.org/content/gutenberg_de_all_2025-10/53217_fig1.jpg
2023-08 2025-10
Size 30k 63k
Preview Image Image
Online at https://dev.library.kiwix.org/content/gutenberg_de_all_2023-08/52492_abb12.jpg https://browse.library.kiwix.org/content/gutenberg_de_all_2025-10/52492_abb12.jpg

While we could see some visual artifacts induced by higher compression, the difference in file size is clearly at the advantage of higher compression.

We should probably:

  • put compression of images back in place
  • confirm expected size difference (should save about 2.11G on Gutenberg DE)
  • decide how to handle optimization cache invalidation (see https://github.com/openzim/gutenberg/issues/288)
  • put optimization cache back in place

Or is this kind of image increased optimization something to do on Gutenberg side directly? @eshellman can you advise? I imagine compression is less important on Gutenberg side than on Kiwix one, but still this is not negligible.

benoit74 avatar Dec 08 '25 13:12 benoit74

PG doesn't do much compression. I would bet that only compressing the largest files could get you 80% of the total size reduction benefit with zero quality degradation on current displays.

eshellman avatar Dec 08 '25 14:12 eshellman

OK, that contradicts what I've understood from https://github.com/openzim/gutenberg/issues/288#issuecomment-3339000032 where you said "I'm not sure what optimizing you can do". Do you confirm I misinterpreted you (hoping you've said what I wanted to hear, classical bias)?

benoit74 avatar Dec 09 '25 05:12 benoit74

https://github.com/gutenbergtools/ebookmaker/blob/fa2e91f4bf1ab75674919841c6c36a489596530b/src/ebookmaker/writers/EpubWriter.py#L50 sets the max image size to 256K (1M for linked images); so images smaller than that get no compression other than what zip does. What I was talking about in #288 was that, yes the ebook files are rebuilt every month (and have been since 2017?)

eshellman avatar Dec 09 '25 15:12 eshellman