seastar rpc: add zstd streaming compressor

TODO:

Add zstd to seastar dependencies.
Check that exceptions in the constructor of the compressor are handled properly.

Considerations:

With window size 16 (chosen in this patch), zstd will be able to backreference 64 kiB into the past, at the cost of ~521 kiB per compressor and 413 kiB per decompresor, which adds up to 1 MiB per connection which uses this compression. (Note: since these are separate allocations now, it really adds up to 1.5 MiB due to separate rounding. But we can fix that, if we want).

This can add up to a great overhead with a large number of RPC connections. Users of seastar should consider enabling it selectively, only for the most voluminous connections.

Jul 07 '23 18:07 michoecho

Refs #14491

cc @nyh @mykaul Re: higher compression. Here's an implementation using streaming ZSTD.

Tested with Scylla.

Jul 07 '23 18:07 michoecho

With window size 16 (chosen in this patch), zstd will be able to backreference 64 kiB into the past, at the cost of ~521 kiB per compressor and 413 kiB per decompresor, which adds up to 1 MiB per connection which uses this compression. (Note: since these are separate allocations now, it really adds up to 1.5 MiB due to separate rounding. But we can fix that, if we want).

@gleb-cloudius was specifically concerned with this memory cost.

Jul 09 '23 06:07 mykaul

Considerations:

With window size 16 (chosen in this patch), zstd will be able to backreference 64 kiB into the past, at the cost of ~521 kiB per compressor and 413 kiB per decompresor, which adds up to 1 MiB per connection which uses this compression. (Note: since these are separate allocations now, it really adds up to 1.5 MiB due to separate rounding. But we can fix that, if we want).

This is a huge tradeoff, which Seastar users should be able to control - as users of a library should be able to do. Most compression programs give you a three-way tradeoff between compression ratio, memory use and CPU use, and as you noted, you need this here too.

The memory/ratio tradeoff you already mentioned: For example changing from log window size 16 to 15 will reduce the memory use in half, but also have a smaller (32KB) sliding window, so compression ratio will be lower - but how much lower? Maybe it's a worthy tradeoff. Maybe even smaller windows can be good enough for most users? Maybe even a small window can be significantly better than the option we have today, namely lz4?

There might also be a memory/cpu tradeoff that you didn't mention: Why does a 64KB sliding window take 521 KB (rounded up to 1MB!) of memory? Could zstd perhaps be configured to store the sliding window in a less wasteful way, perhaps at the cost of CPU to find matches in this window?

This can add up to a great overhead with a large number of RPC connections. Users of seastar should consider enabling it selectively, only for the most voluminous connections.

If we had a configurable window-size parameter, we could imagine having the code change this parameter automatically according to total volume - e.g., start with a 4KB window and switch to 8KB after 1GB of data, and so on (both sender and receiver need to switch at the same time based on the same knowledge). But I think this will be to complicated to be a good idea.

Scylla, for example, could enable it only for connections to sibling shard on other nodes, but not for non-sibling shards on other nodes. This would reap near-full benefits of compression in well-behaved (i.e. homogeneous) clusters, without inviting a catastrophical buffer size explosion in large clusters.

When a coordinator forwards a request to a replica, is it always between sibling shards?

Jul 09 '23 08:07 nyh

I suspect the move to zstd by itself is a a nice improvement (compression-wise) for some workloads, just because it'll add the entropy encoding which LZ4 lacks. Streaming will another another improvement, and therefore I think it could be a minimal window size, to reduce the memory overhead.

Jul 09 '23 11:07 mykaul

I suspect the move to zstd by itself is a a nice improvement (compression-wise) for some workloads, just because it'll add the entropy encoding which LZ4 lacks. Streaming will another another improvement, and therefore I think it could be a minimal window size, to reduce the memory overhead.

If RPC without a history compresses each message separately, it can result in horrible compression (potentially, even making things worse than plaintext!) for very short mutations. So for very short mutations, it does make some sense to have a sliding window of some size - even a small 4KB one - and keep it across messages. But for very long mutations, it might actually be better to not keep a small sliding window across messages - it would be better to have a long, but fresh, sliding window per message (no memory between messages).

Since Seastar is a library, I think we have no choice but to provide all of these options to the application. I guess some of the combinations will never make sense, but I think there isn't the "best" option that everyone will want.

Jul 09 '23 11:07 nyh

This is a huge tradeoff, which Seastar users should be able to control - as users of a library should be able to do. Most compression programs give you a three-way tradeoff between compression ratio, memory use and CPU use, and as you noted, you need this here too. The memory/ratio tradeoff you already mentioned: For example changing from log window size 16 to 15 will reduce the memory use in half, but also have a smaller (32KB) sliding window, so compression ratio will be lower - but how much lower? Maybe it's a worthy tradeoff. Maybe even smaller windows can be good enough for most users? Maybe even a small window can be significantly better than the option we have today, namely lz4?

@nyh True, but note that parameter control is something that can be added later. Compression parameters are not a part of the protocol implemented here.

However, we might want to make the parameters live-updatable, and in that case we have extend the protocol with some way to indicate a reset of the compression state. We could add an extra flag byte to the header for that purpose.

(rounded up to 1MB!)

Clarification: it's not zstd who is rounding allocations up to a power of 2, but Seastar's main allocator.

There might also be a memory/cpu tradeoff that you didn't mention: Why does a 64KB sliding window take 521 KB (rounded up to 1MB!) of memory? Could zstd perhaps be configured to store the sliding window in a less wasteful way, perhaps at the cost of CPU to find matches in this window?

Zstd is storing a bunch of internal state in that memory, not just the window. The memory usage of the compressor in this setup is roughly windowSize + 6*blockSize + 90 kiB. For the decompressor it's roughly windowSize + 2*blockSize + 220 kiB. The default block size is min(windowSize, 128 kiB). So if windowSize == 64 kiB, then also blockSize == 64 kiB, and the formulas yield the values I mentioned in the first post.

AFAIK there is no way to configure zstd to make this "less wasteful". But we can shrink the blockSize even further.

Additionally, 1 blockSize per compressor/decompressor could be saved by switching to an advanced buffer-less API, which is harder to use and also was recently deprecated. So I think it's not worth the hassle.

If we had a configurable window-size parameter, we could imagine having the code change this parameter automatically according to total volume - e.g., start with a 4KB window and switch to 8KB after 1GB of data, and so on (both sender and receiver need to switch at the same time based on the same knowledge). But I think this will be to complicated to be a good idea.

IIUC this idea only works for short-lived connections, but not long-lived-but-slow connections. The latter will eventually reach the maximum window (/block) size, so the idea would only the waste.

Scylla, for example, could enable it only for connections to sibling shard on other nodes, but not for non-sibling shards on other nodes. This would reap near-full benefits of compression in well-behaved (i.e. homogeneous) clusters, without inviting a catastrophical buffer size explosion in large clusters.

When a coordinator forwards a request to a replica, is it always between sibling shards?

Yes, they are always between sibling shards. (Where "sibling shards" are the ones with the equal shard_id % min(node1_shard_count, node_shard_count). But actually I was wrong above, because Scylla doesn't maintain any connections to non-siblings in the first place. So that concern was invalid. However, to each sibling it has 2 connections (gossip, streaming) + 3 connections per service level (statements, MUTATION_DONE, forward service), but only streaming and statement connections send notable volumes of data. So what I should have really said there instead is that Scylla should not enable compression for the connections where it doesn't matter.

The fact that there is a separate statement connection per service level is notable here. Perhaps compression settings should be controlled separately per service level.

So for very short mutations, it does make some sense to have a sliding window of some size - even a small 4KB one - and keep it across messages.

I think it makes more than "some" sense, because this window is what allows you to do inter-mutation compression, and I'd guess that's where the real redundancy is.

But for very long mutations, it might actually be better to not keep a small sliding window across messages - it would be better to have a long, but fresh, sliding window per message (no memory between messages).

That's a very good point which I didn't think about. We can consider adding a special behavior for large messages which doesn't put them through the streaming compression, but does a one-shot compression with a larger window size. But I have no intuition about how much of an improvement could that give in practice.

Since Seastar is a library, I think we have no choice but to provide all of these options to the application. I guess some of the combinations will never make sense, but I think there isn't the "best" option that everyone will want.

There is a different choice here. Compression is actually not coupled with seastar. Users can write custom compressors for their connections. So what we can do instead is to start by implementing and using the zstd compressor in Scylla, and graduate it to seastar later, when we are more experienced with it

I suspect the move to zstd by itself is a a nice improvement (compression-wise) for some workloads, just because it'll add the entropy encoding which LZ4 lacks. Streaming will another another improvement, and therefore I think it could be a minimal window size, to reduce the memory overhead.

@mykaul It seems to me that our intuition about compression parameters is very limited, so we should be careful with this kind of statements. I think the best way to learn how the paremeters perform in practice is to make them live-updatable, and try out various values for them in multiple practical use cases (e.g. in different Scylla clusters).

Jul 09 '23 14:07 michoecho

@nyh True, but note that parameter control is something that can be added later. Compression parameters are not a part of the protocol implemented here.

Yes, definitely this PR can be improved (and you even marked it as Draft). But right from the start we should be aware what is the limitations of this version and what are the inherent limitations of the whole approach. And I was hoping that we'll reach the conclusion that the 1.5MB storage per connection is not an inherent limitation but a configurable parameter (even if we don't configure it yet).

I remember many years ago, when I first read (and was amazed) about bzip and then (then) new block-sorting algorithm (the Burrows–Wheeler transform), it had a configurable-sized buffer, from 100KB to 900KB. The 900KB buffer seemed ridiculously high, and nobody actually used. And today we're thinking about 1.5MB per connection as something very reasonable ;-)

However, we might want to make the parameters live-updatable, and in that case we have extend the protocol with some way to indicate a reset of the compression state. We could add an extra flag byte to the header for that purpose.

The other option we already have is to negotiate the algorithm on each connection, so to get different compression take effect, you just need to somehow close all existing connections, eventually. Maybe there should be an operation related to "live update" to achieve that.

Zstd is storing a bunch of internal state in that memory, not just the window. The memory usage of the compressor in this setup is roughly windowSize + 6*blockSize + 90 kiB. For the decompressor it's roughly windowSize + 2*blockSize + 220 kiB. The default block size is min(windowSize, 128 kiB). So if windowSize == 64 kiB, then also blockSize == 64 kiB, and the formulas yield the values I mentioned in the first post.

I see. But I don't understand why does blocksize need to be related to windowSize, or where does the 6*blocksize come from. To be honest, I'm not even sure what the concept of "blocksize" means when you need to compress individual messages separately (using shared history, but still - separately compress and send each, without the concept of "blocks").

Since the formula above has windowSize only once and blockSize 6 times, it seems important to understand what this blockSize means. I plan to read the zstd documentation/papers, but haven't done so yet :-(

There is a different choice here. Compression is actually not coupled with seastar. Users can write custom compressors for their connections. So what we can do instead is to start by implementing and using the zstd compressor in Scylla, and graduate it to seastar later, when we are more experienced with it

That's true. I was commenting about Seastar because this is where we're holding this discussion.

Jul 09 '23 15:07 nyh

Adding an idea (which I do not think is a good one, but wanted to ensure we explore) - what if we add entropy encoding (such as huff0 - https://github.com/Cyan4973/FiniteStateEntropy ) over LZ4?

Jul 09 '23 15:07 mykaul

I was hoping that we'll reach the conclusion that the 1.5MB storage per connection is not an inherent limitation but a configurable parameter (even if we don't configure it yet).

As far as I understand zstd sources, it isn't inherent to how zstd works. Most of this memory (the blockSize*6 thing) is for temporary buffers which don't have to persist across blocks. These buffers could be shared between connections, and then the only per-connection state would be the sliding window and an index (a relatively small hash table) over the window. (And some entropy coding tables, which are even smaller).

However, zstd exposes no API which allows for such a setup. There is only one "compression context" and it contains both the persistent and the temporary buffers. We would have to fork zstd and modify it so that the two sets of buffers are allocated separately, and that the temporary buffers can be shared between compression contexts. I think I could do it in two days or so, unless I'm misunderstanding something important. But obviously forking a library comes with its own set of problems.

The other option we already have is to negotiate the algorithm on each connection, so to get different compression take effect, you just need to somehow close all existing connections, eventually.

Yes, but restarting connections is too disruptive a technique, if the same effect can be achieved within the compression protocol.

I see. But I don't understand why does blocksize need to be related to windowSize

I don't understand this one either.

or where does the 6*blocksize come from.

This I understand from reading zstd sources, but I won't attempt to explain it here. It's not that important. The important thing is that — IIUC — this buffer is temporary in principle and could be shared between connections in the ideal setup.

To be honest, I'm not even sure what the concept of "blocksize" means when you need to compress individual messages separately (using shared history, but still - separately compress and send each, without the concept of "blocks").

The implementation needs a few bytes of "workspace" buffers per each byte of the input, so long messages have to be broken down into separately-compressed blocks of fixed size (the last one can be smaller) to keep the memory usage constant after initialization. Thus "block size" governs the size of internal temporary buffers needed by the implementation, and the effectiveness of entropy coding (ideally blocks should be big enough so that the coding tables don't have too much overhead with respect to data, but they also should be small enough so that local probability changes are exploited effectively).

Since the formula above has windowSize only once and blockSize 6 times, it seems important to understand what this blockSize means. I plan to read the zstd documentation/papers, but haven't done so yet :-(

I couldn't find any documentation or papers about the internals. I had to go and read the sources to learn anything. If you find some papers, link them, please.

Adding an idea (which I do not think is a good one, but wanted to ensure we explore) - what if we add entropy encoding (such as huff0 - https://github.com/Cyan4973/FiniteStateEntropy ) over LZ4?

We could try, but it sounds like hand-rolling an inferior version of zstd. If you just pipe the lz4 output to an entropy coder (with bytes as the alphabet), you are losing the literals+matches structure, which provides a more effective model than raw bytes. An integrated algorithm, like zstd, can (and in zstd's case — does) encode literals and matches separately, and even with different algorithms.

Jul 09 '23 23:07 michoecho

I couldn't find any documentation or papers about the internals. I had to go and read the sources to learn anything. If you find some papers, link them, please.

I found a paper "Asymmetric numeral systems" by Jarek Duda, about his improved entropy coding which supposedly beats the speed of arithmetic coding (used by bzip before the whole patent fiasco) and beats the compression ratio of huffman trees. But it doesn't explain specific details about the code.

There's Facebooks not-a-paper https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/

It's surprising that nobody (?) wanted to use zstd for exactly the purpose we want - to compress a long-running stream of messages over TCP. Any such use case will want to store as little as possible data in the long term (basically, the sliding window and some sort of hash table). It might need additional temporary memory while compressing, but not permanently while the connection isn't used.

Jul 10 '23 09:07 nyh

@alecco - you may want to look at this - and I'd appreciate your experience in how we can reduce the memory usage here.

Jul 10 '23 15:07 mykaul

ScyllaDB has O(nr_nodes * 3 * nr_service_levels) connections per shard, and the same number of incoming connections.

With 100 nodes, 100 shards, we have 60,000 connections. We can afford at most 10kB per connection here.

Jul 10 '23 15:07 avikivity

ScyllaDB has O(nr_nodes * 3 * nr_service_levels) connections per shard, and the same number of incoming connections.

With 100 nodes, 100 shards, we have 60,000 connections. We can afford at most 10kB per connection here.

Only if you want to make this compression the default for all possible connections in all clusters. But it's not what it's about.

Consider the following: the entire discussion about maybe using stronger compression in RPC was prompted by a user who is allegedly "bottlenecked" on networking bills, particularly cross-dc replication.

This user has 36 nodes in 6 DCs. If on every shard you only enable compression for 2 connections (the streaming connection and the data connection for a single service level) to every remote-DC node, you have 60 compressed (bidirectional) connections on each shard in the cluster.

Each shard has 8 GiB of memory. Assuming you are willing to sacrifice 1% of the node's memory for this feature, you get a budget of 8 GiB / 60 = ~1.25 MiB~ 1.33 MiB per each (bidirectional) compressed connection. Or am I missing something fundamental?

I don't think zstd is a sane default for this use case. It's not designed for that. But it might be worth trying in cases where network transfer is the bottleneck. Importantly, RPC compressor designs are "cheap" — we have compressor negotiation so there is no need to be concerned with backwards compatibility. We can stuff in as many crazy compression ideas as we want, and delete them when we please. (As long as they stay an experimental API.)

Jul 10 '23 16:07 michoecho

This user has 36 nodes in 6 DCs. If on every shard you only enable compression for 2 connections (the streaming connection and the data connection for a single service level) to every remote-DC node, you have 60 compressed (bidirectional) connections on each shard in the cluster.

Cross AZ traffic, while 50% of the cost of cross region traffic, is not free either (at least on AWS). I do agree this compromise of compressing only some of the connections is the most reasonable path.

Jul 11 '23 08:07 mykaul

@alecco - you may want to look at this - and I'd appreciate your experience in how we can reduce the memory usage here.

zstd supports dictionary for streaming, perhaps controlling a collection of dictionaries (e.g. by rpc type) could help. The dictionary size can be defined.

But it's a bit more work and needs testing if it's worth it. Depends on what's the bulk of the RPC data.

Jul 11 '23 14:07 alecco

Cross-AZ traffic is actually more expensive. Cross-AZ costs 1c per direction, so 2c/FB. Cross-region costs 2c/GB, but ScyllaDB sends one cross-region message for inter-region replication and (at least) two cross-zone messages for intra-region replication.

Jul 11 '23 14:07 avikivity

Perhaps we should be offloading this. VPNs support compression. I don't know if the compression is per-packet or streaming (in TCP mode).

https://openvpn.net/community-resources/reference-manual-for-openvpn-2-4/.

Jul 11 '23 14:07 avikivity

ScyllaDB has O(nr_nodes * 3 * nr_service_levels) connections per shard, and the same number of incoming connections. With 100 nodes, 100 shards, we have 60,000 connections. We can afford at most 10kB per connection here.

Only if you want to make this compression the default for all possible connections in all clusters. But it's not what it's about.

Consider the following: the entire discussion about maybe using stronger compression in RPC was prompted by a user who is allegedly "bottlenecked" on networking bills, particularly cross-dc replication.

This user has 36 nodes in 6 DCs. If on every shard you only enable compression for 2 connections (the streaming connection and the data connection for a single service level) to every remote-DC node, you have 60 compressed (bidirectional) connections on each shard in the cluster.

Each shard has 8 GiB of memory. Assuming you are willing to sacrifice 1% of the node's memory for this feature, you get a budget of 8 GiB / 60 = ~1.25 MiB~ 1.33 MiB per each (bidirectional) compressed connection. Or am I missing something fundamental?

I don't think zstd is a sane default for this use case. It's not designed for that. But it might be worth trying in cases where network transfer is the bottleneck. Importantly, RPC compressor designs are "cheap" — we have compressor negotiation so there is no need to be concerned with backwards compatibility. We can stuff in as many crazy compression ideas as we want, and delete them when we please. (As long as they stay an experimental API.)

The amount of tuning this requires is very high. It's not a good solution to have to know so much in order to get something to work.

Jul 11 '23 14:07 avikivity

@alecco - you may want to look at this - and I'd appreciate your experience in how we can reduce the memory usage here.

zstd supports dictionary for streaming, perhaps controlling a collection of dictionaries (e.g. by rpc type) could help. The dictionary size can be defined.

If you are suggesting that maybe we can build a dictionary per RPC type instead of per-connection, well, it can work if there is some way we can plan in advance this dictionary (e.g., see https://en.wikipedia.org/wiki/Brotli with a dictionary designed to be good for HTML). But I don't know how easy that will be... What we can't do is for the dictionary to be dynamically learned from multiple connections, because the problem is that the sender and receiver need to agree on the same dictionary. We can't even use the same dictionary for multiple connections between the same two shards, because the receiver might not receive messages in the same order that the receiver sent them.

So with sadness I suggest that maybe we would need to fork zstandard. Especially to get rid of the ridiculous "6*windowsize" buffers that makes the dictionary size almost a non-issue compared to it. It seems it already a dozens reimplementations in different languages, so one more in C++ won't be that strange. I doubt, however, that in the current atmosphere of the project, we can do it.

Jul 11 '23 16:07 nyh

What about reducing the compression level? won't that reduce memory requirements?

Jul 11 '23 16:07 avikivity

This user has 36 nodes in 6 DCs. If on every shard you only enable compression for 2 connections (the streaming connection and the data connection for a single service level) to every remote-DC node, you have 60 compressed (bidirectional) connections on each shard in the cluster.

Cross AZ traffic, while 50% of the cost of cross region traffic, is not free either (at least on AWS). I do agree this compromise of compressing only some of the connections is the most reasonable path.

True.

So there are 72 compressed connections in the example, not 60. But the budget per connection is bigger than 1 MiB either way.

Cross-AZ traffic is actually more expensive. Cross-AZ costs 1c per direction, so 2c/FB. Cross-region costs 2c/GB, but ScyllaDB sends one cross-region message for inter-region replication and (at least) two cross-zone messages for intra-region replication.

@avikivity Really? I thought that if there are 3 replicas in the local DC (including the coordinator), and 15 replicas in other DCs (3 per DC), then for each write the coordinator will send 2 cross-AZ messages and 15 cross-region messages. Is that not the case?

Perhaps we should be offloading this. VPNs support compression.

https://openvpn.net/community-resources/reference-manual-for-openvpn-2-4/.

It does sound like something that could/should be implemented outside the application. But adding a VPN to the picture sounds more complicated like implementing the thing ourselves, and it gives us less control. Also note that openvpn doesn't support the strong algorithms, only lz4 and lzo.

I don't know if the compression is per-packet or streaming (in TCP mode).

I see in the source code of openvpn that it doesn't keep persistent lz4 state, so it can't be truly streaming. But maybe it at least coalesces multiple TCP packets into one compressed VPN packet or something. (LZO doesn't support cross-block compression at all, I think).

The amount of tuning this requires is very high. It's not a good solution to have to know so much in order to get something to work.

I don't understand what you mean by this. Where's the tuning required?

So with sadness I suggest that maybe we would need to fork zstandard.

Don't you think that before we spend who-knows-how-much time to squeeze out some extra kilobytes per connection it would be nice to implement a sub-optimal, experimental version of the feature and test that how useful it is in practice?

Especially to get rid of the ridiculous "6*windowsize" buffers that makes the dictionary size almost a non-issue compared to it.

6*blocksize, not 6*windowsize, that's an important difference. (Even if in this example windowsize and blocksize are equal).

What about reducing the compression level? won't that reduce memory requirements?

"Compression level" is just a set of defaults for various lower level parameters. This PR already sets level 1, and further shrinks the biggest tables: window size from level's 1 default 512 kiB to 64 kiB, and block size from the default 128 kiB to 64 kiB.

Jul 11 '23 17:07 michoecho

Especially to get rid of the ridiculous "6*windowsize" buffers that makes the dictionary size almost a non-issue compared to it.

6*blocksize, not 6*windowsize, that's an important difference. (Even if in this example windowsize and blocksize are equal).

Yes, this was a typo, and in fact the reason why I called this ridiculous - the "block size" is, if I understand correctly, just the maximum chunk size - if you need to send a 1GB message you don't compress it all at once but in this "block size". But usually you compress smaller messages and the block size shouldn't be relevant, and even more importantly - you shouldn't need to keep 6blocksize - or even 1blocksize - between compression calls. All you should have to keep between compression calls are 1. the windowsize, plus perhaps some sort of hash table proportional to windowsize to make it fast to search in it. 2. the huffman tree or whatever you use for entropy coding. Again should be proportional to windowsize.

I would have liked us to try to use a small (e.g., 8KB) window size and see how much compression-ratio we gain over lz4. With windowsize=8KB, even 10*windowsize storage isn't much. But I understand that the existing zstandard library simply can't do that. You can still do, by the way, one-off compression (not saving history between different messages), I don't think it will be worse than lz4 but I don't know how much better either (we can measure...).

All that being said, we can still use your PR to estimate the potential benefit of how much could be saved by perfecting the compression algorithm, and only after we see great savings, then we can decide how much we want to invest in doing this. A stream compression with 128 KB window is more-or-less the best-case-scenario. We can check how "best" it really is and how much bandwidth could be saved in practice (and how much CPU time is lost at the same time).

Jul 12 '23 07:07 nyh

I would have liked us to try to use a small (e.g., 8KB) window size and see how much compression-ratio we gain over lz4. With windowsize=8KB, even 10*windowsize storage isn't much. But I understand that the existing zstandard library simply can't do that. You can still do, by the way, one-off compression (not saving history between different messages), I don't think it will be worse than lz4 but I don't know how much better either (we can measure...).

It's possible to set windowsize and blocksize to 8 kiB, but it sounds like the compression wouldn't be very effective then.

Yes, this was a typo, and in fact the reason why I called this ridiculous - the "block size" is, if I understand correctly, just the maximum chunk size

But window size is usually much greater than block size. (It hardly makes sense for the window to be smaller than block size). The maximum block size is 128 kiB, the window size is on the order of megabytes by default, and has no upper limit.

All that being said, we can still use your PR to estimate the potential benefit of how much could be saved by perfecting the compression algorithm, and only after we see great savings, then we can decide how much we want to invest in doing this. A stream compression with 128 KB window is more-or-less the best-case-scenario. We can check how "best" it really is and how much bandwidth could be saved in practice (and how much CPU time is lost at the same time).

That was my only intention from the very beginning.

Jul 12 '23 08:07 michoecho

Really? I thought that if there are 3 replicas in the local DC (including the coordinator), and 15 replicas in other DCs (3 per DC), then for each write the coordinator will send 2 cross-AZ messages and 15 cross-region messages. Is that not the case?

No. This is not the case. A coordinator sends to one node in each DC and this not is responsible to send to all local replicas. MUTATION_DONE are sent directly to the coordinator by each replica.

Perhaps we should be offloading this (to VPN)

Need to be careful to not encrypt in Scylla then :) Also how much memory an external VPN will require for 60,000 connections?

Jul 23 '23 09:07 gleb-cloudius

@michoecho - I'm not sure what the next step here?

Jul 25 '23 09:07 mykaul

It's possible to set windowsize and blocksize to 8 kiB, but it sounds like the compression wouldn't be very effective then.

Again, I don't understand at all why blocksize is relevant or saved across sends. But for windowsize, yes, 8KB is expected to be less effective than 128KB or 1MB but I suspect it will still be very much effective and much better than individual-packet LZ4 like we do now. Consider small requests, e.g., 1KB - even 8KB remembers eight of those. Even one previous packet can be a helpful context, eight might be almost as good as 128. And I don't know if we really need to go as low as 8KB. Maybe we can afford a little more per connection. But not 128KB, see Avi's calculations above.

But window size is usually much greater than block size. (It hardly makes sense for the window to be smaller than block size). The maximum block size is 128 kiB, the window size is on the order of megabytes by default, and has no upper limit.

Clearly the zstd example implementation had different use cases in mind than we did. It doesn't mean the algorithm is bad for our needs, but it may mean that the implementation is bad for our needs and we'll need to fork it. It's not something we can't do. It will even be fun. But it's not something we need to do just for a proof-of-concept (for that we just need to show that the compression is better, and by how much). You said this yourself so I don't need to convince you :-)

Jul 25 '23 09:07 nyh

@nyh - what zstd (IMHO) gives over LZ4 is the entropy encoding, which for some workloads (JSON) will be very beneficial. So I'm looking forward to it even with a small(est) window possible, to reduce memory consumption.

Jul 25 '23 09:07 mykaul

@michoecho - I'm not sure what the next step here?

As far as I can see, the PR has been rejected. https://github.com/scylladb/seastar/pull/1726#issuecomment-1629182619 murders the entire idea before we even get to testing CPU costs.

Jul 25 '23 09:07 michoecho

@michoecho - I'm not sure what the next step here?

As far as I can see, the PR has been rejected. #1726 (comment) murders the entire idea before we even get to testing CPU costs.

Is this still the case even if we try out (which I don't think is a good idea...) just ZSTD per packet, instead of streaming?

Jul 25 '23 09:07 mykaul

@michoecho - I'm not sure what the next step here?

As far as I can see, the PR has been rejected. #1726 (comment) murders the entire idea before we even get to testing CPU costs.

I disagree. I think that the same algorithm can be used (even if the implementation will need to be modified) even with the constraint of keeping just 10KB (maybe Avi will allow you a little bit more :-)) between packets - just the sliding window itself and hopefully a bit of hash table to speed lookups up, if it's small enough... Other memory, like "block size", will need to be allocated during the compression and immediately freed. I don't see - theoretically - why this should be impossible.

So I think if you have good, promising, estimates for improved compression and small-enough extra CPU costs, then the implementation is feasible.

Jul 25 '23 09:07 nyh