scylla-cluster-tests Internode compression test

Because of new internode_compression algorithms that are more advanced (e.g. rebuilding dictionary for compression each hour and redistributing it) we need to turn it on for some longevity tests and verify how it works under load at scale for longer periods (at least 1 day).

Apr 24 '24 13:04 soyacz

@mykaul I couldn't find details about this new algorithm, please share a link to proper PR/issue

cc @ShlomiBalalis

Apr 24 '24 13:04 soyacz

https://docs.google.com/document/d/1bYym74RXT4e6umNvS_3I5LusAA0oh9sM3FyD_xU4Ggo/edit#heading=h.dxzbd7ow0p9e https://docs.google.com/presentation/d/1UZcB4KXIJyyP6MyIpAtCx0mjhlkRiMYOfxkM1IGd28M/edit?usp=drive_link https://github.com/scylladb/scylla-enterprise/issues/3318 https://github.com/scylladb/scylla-enterprise/pull/4050

Apr 24 '24 14:04 mykaul

We don't have any longevity using it (only scale and rolling upgrades)

So we should enable it and have a run without those new features, so we can compare both

Apr 25 '24 08:04 fruch

As a sanity check for the feature we selected the [longevity-100gb-4h-test](https://jenkins.scylladb.com/job/scylla-enterprise/job/longevity/job/longevity-100gb-4h-test/) case, to be executed with the following set of dict training parameters values (+ internode_compression enabled):

rpc_dict_training_min_time_seconds: 900
rpc_dict_update_period_seconds: 150
rpc_dict_training_min_bytes: 100000000
internode_compression_zstd_max_cpu_fraction: 0.05
rpc_dict_training_when: "when_leader"

The longevity-100gb-4h-test + dict training, with cpu_fraction 0.05 build failed on disrupt_multiple_hard_reboot_node disruption, but results analysis shows that the dict training feature itself didn't cause the fail. The analysis summary:

The failure is indeed unrelated to the feature. The cause of that failure was that, after a reboot of one of the nodes, the restarted node attempts to serve client queries before it sees that other nodes are UP — so it fails due to unavailability. Since this is a single-node problem, the system should survive it because the clients should retry with other coordinators and the other coordinators should succeed. But apparently they didn't. My main suspicion in this case is this is actually a misconfiguration of the clients, who don't retry properly. (Because this isn't the first time I see such a problem, and the last time I checked the clients weren't properly configured for high availability). But theoretically it could also be caused by a driver problem, bad metadata sent from servers to client, or maybe the restarted node started serving earlier than expected.

The feature was behaving as expected. Just, as far as performance effects go, the workload isn't very interesting, because it sends mostly random data, which cannot be compressed

So as the dict training feature itself doesn't have adverse effect (or show it immediately) during sanity check, we plan to execute another cycle of longevity test for a longer period (1+ day), for two options of internode_compression_zstd_max_cpu_fraction value - 0.05 and 1.0, to evaluate the feature stability at larger scale.

May 13 '24 14:05 dimakr

Another cycle of longevity test was executed against enterprise:latest for the dict_training feature enabled, with the following specifics:

test duration was increased to 24h and test data set size increased to 200GB, the same longevity-100gb-4h-test test configuration was used
two runs of the test config were done - for internode_compression_zstd_max_cpu_fraction parameter set to 0.05 and 1.0

The tests fails:

longevity-200gb-24h + internode_compression_zstd_max_cpu_fraction 1.0 test failed after 19h, on what seems to be the same as https://github.com/scylladb/scylladb/issues/18647 - queries fail with unavailability errors during rolling restart
longevity-200gb-24h + internode_compression_zstd_max_cpu_fraction 0.05 test failed after 13h. This one is a bit weird. It's as if a db node became unresponsive — for seemingly no reason — for about a minute, not just to other nodes, but also to clients. It was during disrupt_run_unique_sequence disruption, somewhere on its 'grow cluster' step (but it's not the added node that became unresponsive). For now there is no explanation for this. And while we can't think of a way this could have been caused by the compression — especially since compression has nothing to do with client-server connections — this isn't something that can just be discard as unrelated.

@fruch It's not quite clear what to do next. Initially the plan was to run a few longevities with compression enabled, and check that it doesn't destabilize anything, doesn't result in regression. But the thing is that longevities themselves are not stable on enterprise:latest, e.g. the history of https://jenkins.scylladb.com/job/scylla-enterprise/job/longevity/job/longevity-100gb-4h-test/ shows that within the last 2 months we had 16 runs of the test config from which only 4 were successful. Should we do another cycle(s) of longevity tests and perform root cause analysis of fails, if there are any, to ensure that it was not caused by dict_training?

Keeping @michoecho in the loop

May 20 '24 12:05 dimakr

Follow up on the remaining test cases that we planned to test the feature on:

Execute short longevities (100gb-4h) with and without dictionary training enabled. The case was repeated for 3 configs:
- with internode_compression enabled for all and no dict. training enabled
- internode_compression is all with dict. training enabled + max_cpu_fraction set to 0.05
- internode_compression is all with dict. training enabled + max_cpu_fraction set to 1.00
The primary goal of the test is to compare basic performance indicators to see if the feature, when enabled, introduces significant performance degradation.

The main observations are that there is a small decrease in Interface RX/TX Bps metrics for the cases of when the dict. training is enabled, values are somewhere ~10% lower comparing to the case when the feature is disabled. This probably indicates better compression when dict. training algorithm is used. The CPU used metrics seem to not differ between cases with no dict. training enabled and the one with enabled feature + max_cpu_fraction set to 0.05. In the case of when feature is enabled + max_cpu_fraction set to 1.00 the values of the metric are somewhat 10% higher - avg 62-65% vs 70-73%. For quick reference, links to screenshots of the Overview and OS metrics Grafana dashboards can be found in the spreadsheet. For more thorough analysis the monitoring stack for each configuration can be restored at any time. Command to restore the stack locally in Docker, or link to the corresponding Jenkins job ti bring the stack back on aws, are also provided in the spreadsheet for each configuration.
Execute rolling upgrade to ensure that there is no cluster outage when the cluster has both nodes active - upgraded with the feature enabled, and not yet upgraded. This test case verification is blocked at the moment by the issue.

@michoecho

May 27 '24 17:05 dimakr

@dimakr can you verify with rolling upgrade again? according to https://github.com/scylladb/scylla-enterprise/issues/4214#issuecomment-2136314563 fix should be already in.

May 29 '24 05:05 soyacz

Tested the rolling upgrade after https://github.com/scylladb/scylla-enterprise/issues/4214 is fixed. The test passed in build https://jenkins.scylladb.com/job/scylla-staging/job/dimakr/job/enterprise-rolling-upgrade-test/7/. No errors are observed in logs around dict_training events. The used dict training and internode compression parameters are:

internode_compression: all
rpc_dict_training_min_time_seconds: 900
rpc_dict_update_period_seconds: 150
rpc_dict_training_min_bytes: 100000000
internode_compression_zstd_max_cpu_fraction: 0.05
rpc_dict_training_when: "when_leader

May 29 '24 14:05 dimakr

@fruch @michoecho We executed the planned test cases for the feature (as was defined in https://github.com/scylladb/scylla-cluster-tests/pull/7401#discussion_r1595477541):

a few cycles of basic longevity-100gb-4h-test longevity test
a few cycles of 1 day long version of the longevity-100gb-4h-test test (with data set increased to 200gb)
rolling upgrade

Upgrade is OK.

The dict. training doesn't seem to cause an outage of a cluster during longevities and disruptions in longevities. Though the longevity tests are not stable themselves on enterprise, it was not noticed that few failures that we observed were caused by the feature.

The basic performance indicators (read/write latencies, ops/s, CPU used, RX/TX bps on interfaces) were compared for tests with and without dict. training enabled. No performance anomalies are observed. There are some indications that the feature, when enabled, may have some small impact on CPU used and RX/TX bps metrics. But the observed difference is small and it cannot be assessed how the advanced compression algorithm impacts performance - the workloads during longevity tests are random and not a good candidate for compression. To clearly identify what are the gains (or cost) of advanced compression algorithm, in terms of performance, we need to execute a separate performance test with dedicated workload (maybe some real-like profile with the help of latte).

May 29 '24 14:05 dimakr

@dimakr let's look at sct_config.py to see if we can get append_scylla_yaml to be merging, and use the test_default.yaml for those values

and then we can try out enabling it by default.

as I understood anyhow those cases we have in SCT doesn't completely demonstrate the behavior of the compression cause the data it total random, and one would need to create a specific tests with specific data and tools for that.

Jun 03 '24 11:06 fruch

#7554 converted append_scylla_yaml attribute value from multiline string to dict throughout SCT, so the values of the attribute can natively be merged now, if used in several test config during a single test run.

Jun 10 '24 10:06 dimakr

@michoecho We would like to enable the feature in tests, to be executed on a regular bases in SCT. Selected values for dict. training parameters will be set in the default test configuration, which applies to all scenario specific configurations (and can be overwritten there). There are 2 main points to agree on:

As per discussion in #7401, the feature won't be enabled by default in scylla due to its performance cost, and also that it is not needed for all clients. Could you please suggest scenarios/configurations, which would be priority ones for testing the feature - multi dc scenarios, upgrade scenarios, scale scenarios, some specific disruptions, etc.? Then we can enable internode_compression for the selected scenarios to make use of the advanced compression (or create a new configuration that covers existing scenario, but with enabled advanced compression).
Is the initially suggested set of values (as below) is a "standard" and can be used by default in configurations, where internode compression will be enabled (or already enabled)?

  rpc_dict_training_min_time_seconds: 900
  rpc_dict_update_period_seconds: 150
  rpc_dict_training_min_bytes: 100000000
  internode_compression_zstd_max_cpu_fraction: 0.05
  rpc_dict_training_when: "when_leader"

Or maybe there are some values that we are also interested in on a regular basis, or in specific configurations/scenarios? Then we will use non-default values in selected configurations.

Jun 10 '24 14:06 dimakr

Could you please suggest scenarios/configurations, which would be priority ones for testing the feature - multi dc scenarios, upgrade scenarios, scale scenarios, some specific disruptions, etc.?

I guess upgrades are the most interesting, because that's the easiest place for correctness bugs to happen in the future.

Also, it would be good to check that streaming scenarios, like bootstrap and repair, aren't slowed down too much.

But ideally I think it should be enabled for all tests, except for performance tests. Is there a reason not to do that?

@fruch once said (https://github.com/scylladb/scylla-cluster-tests/pull/7401#discussion_r1595473709) that he doesn't want to enable it in all tests, because he fears it might cause regressions, which left me very confused. Wouldn't that be a good thing? Don't we want to bring out the possible regressions in the tests?

Compression isn't supposed to completely change the performance characteristics of the cluster. The cost should be on the order of 5-10%, not 50%. It's not supposed to be risky to enable. If it is, it's a problem in itself, and it would be very good to hit that in tests.

Is the initially suggested set of values (as below) is a "standard" and can be used by default in configurations, where internode compression will be enabled (or already enabled)?

There's no "standard", because it's all guesswork for now, but I think this set of values should be good enough. (In practice we would probably make those _seconds intervals longer just to decrease the overall noise, but not much longer). It might be reconsidered in the future, as we become wiser.

Jun 10 '24 15:06 michoecho

see https://github.com/scylladb/scylla-cluster-tests/issues/7998 for more discussion

Jul 23 '24 09:07 fruch

It's unclear to me why this was closed - do we have an internode compression test? Enabling it across the board or not is a different issue.

Jul 25 '24 11:07 mykaul

@mykaul The advanced compression is enabled in 2 SCT configurations in https://github.com/scylladb/scylla-cluster-tests/pull/7923, under the dedicated task https://github.com/scylladb/scylla-cluster-tests/issues/7998.

Jul 25 '24 11:07 dimakr

@mykaul The advanced compression is enabled in 2 SCT configurations in #7923, under the dedicated task #7998.

Thanks. Let's make sure it runs on 2024.1.8 and 2024.2.0 please.

Jul 25 '24 12:07 mykaul

scylla-cluster-tests scylla-cluster-tests copied to clipboard

Internode compression test

scylla-cluster-tests
scylla-cluster-tests copied to clipboard