Compress module strings as concatenation
Rewrite the string storage to use compressed concatenated strings and an index to slice them into the final user strings. Allow users to chose between zlib and bzip2 compression.
Supersedes https://github.com/cython/cython/pull/6969
Looking at the wheels that did manage to be uploaded - this at least seems to drop the size of the .so files in Cython reliably (i.e. none of them are any worse, most are a little smaller). Nothing dramatic but it looks like there's a consistent benefit.
All done. Thanks! With the latest changes, all of the benchmark modules show a decrease in size by 2-7%.
all of the benchmark modules show a decrease in size by 2-7%
To me that seems worthwhile. When I had a look at the Cython wheel earlier it looked similar for the .so files (although the wheel itself didn't change that much probably because it's already compressed). So it probably doesn't help with distribution too much but may keep binaries on disk a little smaller.
the current behaviour is to fall back to uncompressed if the user specifies
CYTHON_COMPRESS_STRINGS==3and it isn't available. That seems reasonable and I don't think more customization is needed.
Should we fall back to zlib first? I consider it quite unlikely that the zlib module won't be available in a given Python installation, it's probably the most widely used compression algorithm worldwide.
We could write:
#if CYTHON_COMPRESS_STRINGS == 3
// zstd
#elif CYTHON_COMPRESS_STRINGS == 2
// bz2
#elif CYTHON_COMPRESS_STRINGS
// zlib
#else
// uncompressed
#endif
That way, users would at least benefit from some kind of decent compression even if they select zstd on older Python installations.
the current behaviour is to fall back to uncompressed if the user specifies
CYTHON_COMPRESS_STRINGS==3and it isn't available. That seems reasonable and I don't think more customization is needed.Should we fall back to zlib first? I consider it quite unlikely that the
zlibmodule won't be available in a given Python installation, it's probably the most widely used compression algorithm worldwide.
Yeah that makes sense to me.
Don't think I have any further comments on this...
It's difficult to say if users seeing the ImportError will be rather those who try to install a package (transitively) or those who build it (and potentially a constrained runtime) themselves. The latter can certainly set the macro but the first need to install something. I would doubt that the latter type are many.
I noticed that the separate storage of Unicode and bytes strings was actually unnecessary. We can merge bytes and Unicode strings into a single byte sequence, with the first byte string being the complete UTF-8 encoded Unicode string.
That gives much better overall compression ratios again.