cython icon indicating copy to clipboard operation
cython copied to clipboard

Compress module strings as concatenation

Open scoder opened this issue 6 months ago • 8 comments

Rewrite the string storage to use compressed concatenated strings and an index to slice them into the final user strings. Allow users to chose between zlib and bzip2 compression.

Supersedes https://github.com/cython/cython/pull/6969

scoder avatar Jun 17 '25 20:06 scoder

Looking at the wheels that did manage to be uploaded - this at least seems to drop the size of the .so files in Cython reliably (i.e. none of them are any worse, most are a little smaller). Nothing dramatic but it looks like there's a consistent benefit.

da-woods avatar Jun 18 '25 06:06 da-woods

All done. Thanks! With the latest changes, all of the benchmark modules show a decrease in size by 2-7%.

scoder avatar Jun 18 '25 19:06 scoder

all of the benchmark modules show a decrease in size by 2-7%

To me that seems worthwhile. When I had a look at the Cython wheel earlier it looked similar for the .so files (although the wheel itself didn't change that much probably because it's already compressed). So it probably doesn't help with distribution too much but may keep binaries on disk a little smaller.

da-woods avatar Jun 18 '25 19:06 da-woods

the current behaviour is to fall back to uncompressed if the user specifies CYTHON_COMPRESS_STRINGS==3 and it isn't available. That seems reasonable and I don't think more customization is needed.

Should we fall back to zlib first? I consider it quite unlikely that the zlib module won't be available in a given Python installation, it's probably the most widely used compression algorithm worldwide.

We could write:

#if CYTHON_COMPRESS_STRINGS == 3
  // zstd
#elif CYTHON_COMPRESS_STRINGS == 2
  // bz2
#elif CYTHON_COMPRESS_STRINGS
  // zlib
#else
  // uncompressed
#endif

That way, users would at least benefit from some kind of decent compression even if they select zstd on older Python installations.

scoder avatar Jun 19 '25 06:06 scoder

the current behaviour is to fall back to uncompressed if the user specifies CYTHON_COMPRESS_STRINGS==3 and it isn't available. That seems reasonable and I don't think more customization is needed.

Should we fall back to zlib first? I consider it quite unlikely that the zlib module won't be available in a given Python installation, it's probably the most widely used compression algorithm worldwide.

Yeah that makes sense to me.

da-woods avatar Jun 19 '25 11:06 da-woods

Don't think I have any further comments on this...

da-woods avatar Jun 19 '25 17:06 da-woods

It's difficult to say if users seeing the ImportError will be rather those who try to install a package (transitively) or those who build it (and potentially a constrained runtime) themselves. The latter can certainly set the macro but the first need to install something. I would doubt that the latter type are many.

scoder avatar Jun 19 '25 19:06 scoder

I noticed that the separate storage of Unicode and bytes strings was actually unnecessary. We can merge bytes and Unicode strings into a single byte sequence, with the first byte string being the complete UTF-8 encoded Unicode string.

That gives much better overall compression ratios again.

scoder avatar Jun 19 '25 21:06 scoder