redpanda ARM: scale tests require more resources than nodes have

i3en.xlarge has 8GB per vCPU

Is4gen.4xlarge has 6GB per vCPU

Our scale tests do not pass reliably on the weaker arm nodes.

FAIL test: ManyPartitionsTest.test_many_partitions (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-43-14:9092,ip-172-31-32-247:9092,ip-172-31-36-138:9092,ip-172-31-46-34:9092,ip-172-31-36-63:9092,ip-172-31-42-204:9092,ip-172-31-43-72:9092,ip-172-31-40-33:9092,ip-172-31-47-0:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-39-61:9092,ip-172-31-44-233:9092,ip-172-31-41-134:9092,ip-172-31-43-92:9092,ip-172-31-43-95:9092,ip-172-31-33-240:9092,ip-172-31-45-166:9092,ip-172-31-44-245:9092,ip-172-31-44-142:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: <BadLogLines nodes=ip-172-31-36-63(1) example="ERROR 2022-11-19 20:52:54,026 [shard  0] seastar - Failed to allocate 7340032 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-44-245:9092,ip-172-31-43-92:9092,ip-172-31-39-61:9092,ip-172-31-43-95:9092,ip-172-31-45-166:9092,ip-172-31-44-233:9092,ip-172-31-33-240:9092,ip-172-31-41-134:9092,ip-172-31-44-142:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b

FAIL test: ManyClientsTest.test_many_clients (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: <BadLogLines nodes=ip-172-31-42-204(1) example="ERROR 2022-11-20 01:46:40,286 [shard 1] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: <BadLogLines nodes=ip-172-31-39-61(1) example="ERROR 2022-11-21 01:46:02,937 [shard 0] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b

Nov 21 '22 16:11 jcsp

I tried this out ManyPartitionsTest.test_many_partitions with PARTITIONS_PER_SHARD = 100 on is4gen.4xlarge, which is what the nightly runs use. In that instance it does get past the timeout creating the topics, but then fails in _single_node_restart while waiting for a restarted node to regain leaderships. The leader balancer trying to move leaderships but getting raft::errc::not_leader errors.

Nov 22 '22 14:11 jcsp

Another instance..

https://buildkite.com/redpanda/vtools/builds/4730#018526d6-f604-42f0-b676-a71840fdf989

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2022-12-19T05:20:39.021Z: <BadLogLines nodes=ip-172-31-34-1(1) example="ERROR 2022-12-19 02:32:02,945 [shard 0] seastar - Failed to allocate 66432 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4730#018526d6-f604-42f0-b676-a71840fdf989

Dec 20 '22 05:12 bharathv

https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

FAIL test: ManyPartitionsTest.test_many_partitions (1/3 runs)
  failure at 2022-12-21T05:00:23.156Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/3 runs)
  failure at 2022-12-21T05:00:23.156Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

Dec 22 '22 00:12 bharathv

I'm tempted to group all of these in here:

FAIL test: ManyClientsTest.test_many_clients (1/14 runs) failure at 2022-12-29T04:31:38.661Z: <BadLogLines nodes=ip-172-31-36-118(1) example="ERROR 2022-12-29 02:11:31,581 [shard 1] seastar - Failed to allocate 131072 bytes"> on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888 FAIL test: ManyPartitionsTest.test_many_partitions (7/14 runs) failure at 2022-12-30T04:28:17.969Z: TimeoutError('Redpanda service ip-172-31-47-43 failed to start within 60 sec') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a failure at 2022-12-29T04:31:38.661Z: TimeoutError('Redpanda service ip-172-31-46-161 failed to start within 60 sec') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888 failure at 2022-12-28T04:28:12.819Z: AssertionError('Unable to determine group within set number of attempts') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55 failure at 2022-12-27T04:39:59.859Z: <BadLogLines nodes=ip-172-31-41-124(1),ip-172-31-45-245(1),ip-172-31-39-189(1) example="ERROR 2022-12-26 20:33:40,577 [shard 0] seastar - Failed to allocate 4870880 bytes"> on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be failure at 2022-12-26T04:32:21.108Z: TimeoutError('Redpanda service ip-172-31-45-39 failed to start within 60 sec') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a failure at 2022-12-24T04:13:23.294Z: <BadLogLines nodes=ip-172-31-43-39(1) example="ERROR 2022-12-23 20:21:14,100 [shard 10] rpc - server.cc:119 - Error[applying protocol] remote address: 172.31.32.149:63354 - seastar::broken_promise (broken promise)"> on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 failure at 2022-12-25T04:20:18.252Z: TimeoutError('Redpanda service ip-172-31-37-74 failed to start within 60 sec') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1 FAIL test: ManyPartitionsTest.test_many_partitions_compacted (7/14 runs) failure at 2022-12-30T04:28:17.969Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a failure at 2022-12-29T04:31:38.661Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888 failure at 2022-12-28T04:28:12.819Z: <BadLogLines nodes=ip-172-31-37-6(1) example="ERROR 2022-12-27 21:01:46,318 [shard 0] seastar - Failed to allocate 4870880 bytes"> on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55 failure at 2022-12-27T04:39:59.859Z: <BadLogLines nodes=ip-172-31-39-189(1) example="ERROR 2022-12-26 20:52:41,574 [shard 11] rpc - server.cc:119 - Error[applying protocol] remote address: 172.31.42.51:64603 - seastar::broken_promise (broken promise)"> on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be failure at 2022-12-26T04:32:21.108Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a failure at 2022-12-24T04:13:23.294Z: TimeoutError('') on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45 failure at 2022-12-25T04:20:18.252Z: <NodeCrash ip-172-31-46-176: Redpanda process unexpectedly stopped> on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1

The last one, Redpanda process unexpectedly stopped is due to Failed to allocate 4870880 bytes during shutdown. And also contains Semaphore timed out: raft/connected):

WARN  2022-12-24 20:44:09,721 [shard  1] seastar - Exceptional future ignored: seastar::named_semaphore_timed_out (Semaphore timed out: raft/connected), backtrace: 0x4b7abe7 0x48c08fb 0x1f24867 0x496c637 0x496f49f 0x49a7a73 0x491a017 /opt/redpanda/lib/libc.so.6+0x843b7 /opt/redpanda/lib/libc.so.6+0xef2db
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&), seastar::futurize<raft::consensus::linearizable_barrier()::$_49>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)>(seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)&, seastar::future_state<seastar::internal::monostate>&&), void>

Dec 30 '22 14:12 BenPope

@BenPope do you think we should break out the ignored exceptional future into a separate item? presumably that will exist independent of this generic ARM resource issue?

Dec 30 '22 18:12 dotnwat

I grouped it here as evidence of resource starvation, but yes, it should probably be addressed separately.

Dec 30 '22 18:12 BenPope

I grouped it here as evidence of resource starvation, but yes, it should probably be addressed separately.

got it. just wanted to make sure we don't lose track of the ignored future since even if we fixed some resource issue root cause for failures, the ignored future would still exist.

Dec 30 '22 18:12 dotnwat

Currently we are running nightly CDT tests on a 6x i3en.xlarge cluster for x86 and a 6x is4gen.4xlarge cluster for ARM. Meaning that the ARM cluster is roughly x4-x8 the size of the x86 cluster depending on the performance impacts of hyperthreads counting as vCPUs on the x86 cluster.

Tests in the ManyPartitionsTest will scale accordingly to total shards in a cluster. This entails the test using ~32,000 partitions on the ARM test vs ~8,000 partitions on the x86 test. Combining this with the lower memory per shard that the ARM cluster has its pretty likely this is the issue. The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

For the ManyClientsTest the reason for the bad_allocs isn't as simple. The tests limit Redpanda's CPU cores to 2 and memory to 768mb on each node. So the cluster size differences won't change the test like in the ManyPartitionsTest. One potential cause for the bad_allocs is instead from how the client-swarm works. The app spawns a separate thread per producer with each thread trying to produce a fix number of messages as fast as possible. Hence it should be producing messages a lot quicker on the ARM is4gen.4xlarge node which is far larger than the x86 i3en.xlarge node. This could lead to the ARM cluster having to deal with higher throughput than the x86 cluster. I'm currently getting some metrics from both tests to see if this is the case. If it is then the solution may be to modify the client-swarm to produce at fixed throughputs or to limit the application to a fix number of cores on the large ARM cluster.

Jan 11 '23 23:01 ballard26

Nice, thanks for this Brandon!

The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

This will change the cluster for all scale tests, right?

Jan 12 '23 00:01 travisdowns

The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

This will change the cluster for all scale tests, right?

I will just restrict the cores/memory RP can use in the ManyPartitionsTest to match what is available in the x86 version to start off with. That won't effect any of the other scale tests. We should look into using a smaller node type for the CDT runs eventually though. I imagine that a is4gen.2xlarge should suffice for our tests.

Jan 12 '23 01:01 ballard26

Unfortunately even after restricting resources on the ARM cluster the issues in the ManyPartitionsTest wasn't fixed. As @jcsp noticed earlier the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster. Easily seen here;

[brandonallard@fedora results]$ grep -rn "failed with error: raft::errc::not_leader" latest-amd/ManyPartitionsTest/test_many_partitions | wc -l
6
[brandonallard@fedora results]$ grep -rn "failed with error: raft::errc::not_leader" latest-arm/ManyPartitionsTest/test_many_partitions | wc -l
55

The leader balancer bases its knowledge of a cluster's leadership from the partition_leaders_table so for some reason this table is stale more often on the ARM cluster. Looking into the reason for this currently.

Another interesting observation is that the node that is restarted in the test is muted on the x86 balancer as expected, but is never muted on the ARM balancer.

latest-amd/ManyPartitionsTest/test_many_partitions/1/RedpandaService-0-140243031493984/ip-172-31-4-171/redpanda.log:1464:INFO  2023-01-12 04:07:02,885 [shard 0] cluster - leader_balancer.cc:493 - Leadership rebalancer muting node 9 last heartbeat 26735 ms
[brandonallard@fedora results]$ grep -rn "muting" latest-amd/ManyPartitionsTest/test_many_partitions | wc -l
50
[brandonallard@fedora results]$ grep -rn "muting" latest-arm/ManyPartitionsTest/test_many_partitions | wc -l
0

The leader balancer mutes nodes from heartbeat information from the raft0 follower_stats which could imply that this is more stale on ARM than x86 as well.

Jan 12 '23 05:01 ballard26

As @jcsp noticed earlier the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster. Easily seen here;

Agree that this is the right place to look: the MaintenanceTest failure involved leader balancer strangeness too https://github.com/redpanda-data/redpanda/issues/7428

This could lead to the ARM cluster having to deal with higher throughput than the x86 cluster. I'm currently getting some metrics from both tests to see if this is the case. If it is then the solution may be to modify the client-swarm to produce at fixed throughputs or to limit the application to a fix number of cores on the large ARM cluster.

In this situation, we need to dig in and fix redpanda to handle the load: it's okay if it can't keep, but it shouldn't crash. Since this is a producer, the Kafka memory limit semaphore should know a-priori how big a message will be, and account for it: if we're bad_alloc'ing then something is going wrong with that memory management.

(the genesis of client-swarm was to reproduce crashes that a customer saw with significant client counts: this is not a hypothetical)

Jan 12 '23 10:01 jcsp

A quick update. @mmaslankaprv came up with an explanation as to why the controller in ARM tests seem to have stale information The ARM tests had a larger number of in-flight requests compared to the x86 tests. About 1,500 in the ARM tests vs ~50 in the x86 tests. This could explain why the controller is slow to update.

As to why there are so many more in-flight requests I've noticed that the background traffic is runs during the leadership balancing is 10x more on the ARM cluster than the x86 cluster even after resources are properly limited on the RP nodes. About 420MB/s on ARM and 42MB/s on x86.

[brandonallard@fedora results]$ grep -rn "approx bandwidth" .
./latest-arm/ManyPartitionsTest/test_many_partitions/1/test_log.info:153:[INFO  - 2023-01-12 03:34:25,333 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 423.66427656465044MB/s
./latest-arm/ManyPartitionsTest/test_many_partitions/1/test_log.debug:25670:[INFO  - 2023-01-12 03:34:25,333 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 423.66427656465044MB/s

./latest-amd/ManyPartitionsTest/test_many_partitions/1/test_log.debug:14992:[INFO  - 2023-01-12 04:06:36,354 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 46.84763494224079MB/s
./latest-amd/ManyPartitionsTest/test_many_partitions/1/test_log.debug:187564:[INFO  - 2023-01-12 04:09:20,430 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 42.801016203127205MB/s

This is most likely due to the fact that the kgo-repeater is running on a much larger node in the ARM tests than in the x86 tests.

Jan 15 '23 01:01 ballard26

I see multiple issues that need chasing/fixing here:

the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster.

and:

In this situation, we need to dig in and fix redpanda to handle the load: it's okay if it can't keep, but it shouldn't crash.

and finally to fix the test itself:

As to why there are so many more in-flight requests I've noticed that the background traffic is runs during the leadership balancing is 10x more on the ARM cluster than the x86 cluster even after resources are properly limited on the RP nodes. About 420MB/s on ARM and 42MB/s on x86.

Jan 15 '23 14:01 piyushredpanda

The failures for both the ManyPartitionsTest and ManyClientsTest don't appear to be arm64 specific, but rather issues with the arm64 clusters being x4-x8 the size of the amd64 clusters we are running the tests on. In both cases I've allocated an amd64 cluster that is similarly sized to the arm64 cases. And in both cases the tests failed the same way on the amd64 cluster as they did on the arm64 cluster.

Running the ManyPartitionsTest on a i4i.4xlarge cluster with the memory restricted to match whats available on the is4gen.4xlarge cluster results in identical bad_allocs. So this is not an arm-specific issue its just occurring on the arm cluster since it has half the memory an amd64 cluster would have. However, with ~5Gb per core we don't expect this issue to occur. I will be opening a separate issue for this investigation into why these bad_allocs are occurring.

Running the ManyClientsTest on a i4i.4xlarge cluster fails as well in the same way it does on a is4gen.4xlarge cluster. So this isn't an ARM specific issue either. Rather it appears that the RP cluster(3 nodes with 2 CPUS and 768Mb of mem in both cases) can't handle the increased traffic client-swarm is producing as a result of being allocated on a larger node. I will be opening a separate issue for this as well.

Jan 18 '23 22:01 ballard26

but rather issues with the arm64 clusters being x4-x8 the size of the amd64

to clarify, @ballard26, you mean 4-8x smaller?

Jan 19 '23 01:01 dotnwat

but rather issues with the arm64 clusters being x4-x8 the size of the amd64

to clarify, @ballard26, you mean 4-8x smaller?

The arm64 clusters are 4-8x larger than the amd64 clusters. The amd64 cluster is 4x smaller than the arm64 cluster in terms of pure core count. However, since the core count on amd64 clusters includes hyperthreads in could be up to x8 smaller depending how much you consider a two hyperthreads on the same core to perform as two distinct cores.

Jan 19 '23 01:01 ballard26

@ballard26 ok so is it fair to then say that the Mem/Core ratio on ARM is 4-8x smaller compared to x86?

Jan 19 '23 01:01 dotnwat

@ballard26 ok so is it fair to then say that the Mem/Core ratio on ARM is 4-8x smaller compared to x86?

So on AWS for storage optimized instance types the is4gen ARM instance types that we use always have 2GB less memory per core than the i4i and i3en instance types. I.e, i4i.xlarge has 4 CPUs and 32GiB mem, i3en.xlarge has 4 CPUs and 32GiB mem, and the arm instance is4gen.xlarge has 4 CPUs and 24GiB of memory. Basically 8GB per core on x86 and 6GB per core on ARM for any instance larger than large

Jan 19 '23 02:01 ballard26

The 4-8x was in reference to the CPU count on the clusters we run the CDT nightly on. A 12x i3en.xlarge for the x86 CDT nightly and a 12x is4gen.4xlarge for the ARM CDT nightly. One bit of work should be to reduce the size of the ARM cluster we use. Sorry for the ambiguity.

Jan 19 '23 02:01 ballard26

Could this failure be in the same family? https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/3 runs)
  failure at 2023-01-23T04:13:45.139Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02

Jan 23 '23 14:01 andijcr

https://buildkite.com/redpanda/vtools/builds/5484#0185ea8f-6d91-4f53-94ca-348ecb773302

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2023-01-26T03:42:54.398Z: <BadLogLines nodes=ip-172-31-14-215(1) example="ERROR 2023-01-26 01:01:45,070 [shard 0] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5484#0185ea8f-6d91-4f53-94ca-348ecb773302

Jan 26 '23 07:01 rystsov

https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2023-01-27T03:54:39.906Z: <BadLogLines nodes=ip-172-31-7-211(1) example="ERROR 2023-01-27 01:20:03,830 [shard 1] seastar - Failed to allocate 66432 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310

Jan 27 '23 10:01 andijcr

I've also seen this fail in my Azure CDT runs fairly reliably (same failure mode). I'm using Standard_L8s_v3 nodes for Redpanda and Standard_D4ds_v4 for the client.

Jan 27 '23 10:01 VladLazar

The AssertionError('Unable to determine group within set number of attempts') is happening in both amd64 and arm64 in CDT:

FAIL test: ManyPartitionsTest.test_many_partitions (6/9 runs)

on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6553#0186b5f4-545f-4223-8333-86f48d099e26
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6546#0186b0cf-697b-4d10-8ed5-9d7e5388f2a6
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6548#0186b35e-b995-4a2c-94bd-44127dc7b7e9
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6543#0186ae3a-4642-40c5-be9c-72d77ecbdfed
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6539#0186aba8-f9da-4c97-b205-4ba3a68f3c9f
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6522#0186a917-0e61-42b6-b4cd-a847cf4d283a

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (7/9 runs)

on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6553#0186b5f4-545f-4223-8333-86f48d099e26
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6546#0186b0cf-697b-4d10-8ed5-9d7e5388f2a6
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6548#0186b35e-b995-4a2c-94bd-44127dc7b7e9
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6543#0186ae3a-4642-40c5-be9c-72d77ecbdfed
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6539#0186aba8-f9da-4c97-b205-4ba3a68f3c9f
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6522#0186a917-0e61-42b6-b4cd-a847cf4d283a

Mar 07 '23 03:03 dlex

some more

FAIL test: ManyPartitionsTest.test_many_partitions (1/2 runs)
  failure at 2023-03-07T02:19:15.819Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6564#0186b888-c751-425d-b740-66b8de3fee24
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/2 runs)
  failure at 2023-03-07T02:19:15.819Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6564#0186b888-c751-425d-b740-66b8de3fee24

Mar 08 '23 00:03 ztlpn

Those most recent reports were the issue fixed by https://github.com/redpanda-data/redpanda/pull/9257

Mar 08 '23 08:03 jcsp

ARM tests are okay now - green run from last night here https://buildkite.com/redpanda/vtools/builds/6732#0186e6ee-35c5-4ca7-a43b-6a2c8eb474ce

Mar 16 '23 13:03 jcsp

redpanda redpanda copied to clipboard

ARM: scale tests require more resources than nodes have

redpanda
redpanda copied to clipboard