Book ePub are not always sufficiently compressed
In order to solve https://github.com/openzim/gutenberg/issues/288 and https://github.com/openzim/gutenberg/issues/235 and https://github.com/openzim/gutenberg/issues/222, we've decided to stop compressing (optimizing) ePubs on our own.
Recent runs and analysis done in https://github.com/openzim/gutenberg/issues/374 proved that optimization we were doing on ePubs was not that useless.
See for example book IDs 63630
| 2023-08 | 2025-10 | |
|---|---|---|
| Size | 265K | 520K |
| URL | https://dev.library.kiwix.org/content/gutenberg_de_all_2023-08/Der%20Einzige%20auf%20der%20weiten%20Welt:%20Ein%20Menschenleben.63630.epub | https://browse.library.kiwix.org/content/gutenberg_de_all_2025-10/Der%20Einzige%20auf%20der%20weiten%20Welt:%20Ein%20Menschenleben.63630.epub |
Or book ID 68838
| 2023-08 | 2025-10 | |
|---|---|---|
| Size | 438K | 4.6M |
| URL | https://dev.library.kiwix.org/content/gutenberg_de_all_2023-08/Der%20Graf%20von%20Saint-Germain:%20Das%20Leben%20eines%20Alchimisten.68838.epub | https://browse.library.kiwix.org/content/gutenberg_de_all_2025-10/Der%20Graf%20von%20Saint-Germain:%20Das%20Leben%20eines%20Alchimisten.68838.epub |
It is very important to note that many ePub of 2023-08 were missing all images (including the two examples above, due to https://github.com/openzim/gutenberg/issues/222) but it is not sufficient so far to explain all the file size increase.
I assume it would be safe to:
- first fix https://github.com/openzim/gutenberg/issues/375
- adapt to optimize ePub images (while we could theoritically reuse the same images, not sure it is feasible, at least use some compression settings)
- confirm expected size difference (should save about 3G on Gutenberg DE)
- if size difference is not there, check what else could be optimized in ePub
- if not yet done, decide how to handle optimization cache invalidation (see https://github.com/openzim/gutenberg/issues/288)
- put optimization cache back in place
Remind that we've moved to ePub3 format, so the optimization logic is probably going to be different from what we used to have
if you're using the PG epub3, they will be larger than epub2 for two reasons:
- max cover size/image size is larger
- some epub3 books contain audio files
OK, but the books above have no audio, and I don't see why a larger cover image would 10x the ePub size. At least worth checking if just compressing images a bit more is not sufficient to gain significant size.