gutenberg icon indicating copy to clipboard operation
gutenberg copied to clipboard

Book ePub are not always sufficiently compressed

Open benoit74 opened this issue 1 month ago • 2 comments

In order to solve https://github.com/openzim/gutenberg/issues/288 and https://github.com/openzim/gutenberg/issues/235 and https://github.com/openzim/gutenberg/issues/222, we've decided to stop compressing (optimizing) ePubs on our own.

Recent runs and analysis done in https://github.com/openzim/gutenberg/issues/374 proved that optimization we were doing on ePubs was not that useless.

See for example book IDs 63630

2023-08 2025-10
Size 265K 520K
URL https://dev.library.kiwix.org/content/gutenberg_de_all_2023-08/Der%20Einzige%20auf%20der%20weiten%20Welt:%20Ein%20Menschenleben.63630.epub https://browse.library.kiwix.org/content/gutenberg_de_all_2025-10/Der%20Einzige%20auf%20der%20weiten%20Welt:%20Ein%20Menschenleben.63630.epub

Or book ID 68838

2023-08 2025-10
Size 438K 4.6M
URL https://dev.library.kiwix.org/content/gutenberg_de_all_2023-08/Der%20Graf%20von%20Saint-Germain:%20Das%20Leben%20eines%20Alchimisten.68838.epub https://browse.library.kiwix.org/content/gutenberg_de_all_2025-10/Der%20Graf%20von%20Saint-Germain:%20Das%20Leben%20eines%20Alchimisten.68838.epub

It is very important to note that many ePub of 2023-08 were missing all images (including the two examples above, due to https://github.com/openzim/gutenberg/issues/222) but it is not sufficient so far to explain all the file size increase.

I assume it would be safe to:

  • first fix https://github.com/openzim/gutenberg/issues/375
  • adapt to optimize ePub images (while we could theoritically reuse the same images, not sure it is feasible, at least use some compression settings)
  • confirm expected size difference (should save about 3G on Gutenberg DE)
  • if size difference is not there, check what else could be optimized in ePub
  • if not yet done, decide how to handle optimization cache invalidation (see https://github.com/openzim/gutenberg/issues/288)
  • put optimization cache back in place

Remind that we've moved to ePub3 format, so the optimization logic is probably going to be different from what we used to have

benoit74 avatar Dec 08 '25 15:12 benoit74

if you're using the PG epub3, they will be larger than epub2 for two reasons:

  1. max cover size/image size is larger
  2. some epub3 books contain audio files

eshellman avatar Dec 08 '25 16:12 eshellman

OK, but the books above have no audio, and I don't see why a larger cover image would 10x the ePub size. At least worth checking if just compressing images a bit more is not sufficient to gain significant size.

benoit74 avatar Dec 09 '25 05:12 benoit74