redpanda Crash in `FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment

FAIL test: FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600 (1/1 runs)
  failure at 2022-07-31T11:04:20.402Z: NodeCrash([(<ducktape.cluster.cluster.ClusterNode object at 0x7f40db68a110>, "ERROR 2022-07-31 08:16:28,783 [shard 3] assert - Assert failure: (../../../src/v/storage/segment_reader.cc:215) '_parent == nullptr' Must close before destroying\n"), (<ducktape.cluster.cluster.ClusterNode object at 0x7f40db68b490>, "redpanda: /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/core/future.hh:641: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.\n")])

https://buildkite.com/redpanda/vtools/builds/3073#018252da-dd82-4dac-bad9-ea37369ded6e

Aug 01 '22 06:08 mmaslankaprv

Checked if this was related to #5613 and no: this one is a storage assertion, that one is a segfault.

Aug 01 '22 11:08 jcsp

If I had to guess, my suspicion would be that this is something in the SI cache changes, tearing down readers uncleanly on eviction perhaps.

Aug 01 '22 14:08 jcsp

there are a couple of issues I can (17gb log file so still analysing)

there are bad allocs right before the assertion failure

DEBUG 2022-07-31 08:16:28,668 [shard 3] cloud_storage - [fiber4134~1~0|1|29999ms] - remote.cc:393 - Uploading segment to path {"992bbf98/kafka/topic-xucrhcigfy/65_16/105-1-v1.log.2"}
DEBUG 2022-07-31 08:16:28,668 [shard 3] cloud_storage - [fiber4134~3~0|1|29999ms] - remote.cc:393 - Uploading segment to path {"668a0054/kafka/topic-xucrhcigfy/65_16/110-1-v1.log.2"}
INFO  2022-07-31 08:16:28,669 [shard 3] cluster - ntp: {kafka/topic-xucrhcigfy/93} - archival_metadata_stm.cc:405 - creating snapshot at offset: 188, remote start_offset: 0, last_offset: 106
ERROR 2022-07-31 08:16:28,671 [shard 3] cloud_storage - [fiber4134~1~0|1|29996ms] - remote.cc:140 - Unexpected error std::bad_alloc (std::bad_alloc)
WARN  2022-07-31 08:16:28,671 [shard 3] cloud_storage - [fiber4134~1~0|1|29996ms] - remote.cc:456 - Uploading segment {"992bbf98/kafka/topic-xucrhcigfy/65_16/105-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}, {failed}, segment not uploaded
ERROR 2022-07-31 08:16:28,674 [shard 1] cloud_storage - [fiber3360~49~0|1|29980ms] - remote.cc:140 - Unexpected error std::bad_alloc (std::bad_alloc)
WARN  2022-07-31 08:16:28,674 [shard 1] cloud_storage - [fiber3360~49~0|1|29980ms] - remote.cc:456 - Uploading segment {"d2dcc8b9/kafka/topic-xucrhcigfy/23_16/212-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}, {failed}, segment not uploaded
ERROR 2022-07-31 08:16:28,678 [shard 3] cloud_storage - [fiber4134~3~0|1|29990ms] - remote.cc:140 - Unexpected error std::bad_alloc (std::bad_alloc)
WARN  2022-07-31 08:16:28,678 [shard 3] cloud_storage - [fiber4134~3~0|1|29990ms] - remote.cc:456 - Uploading segment {"668a0054/kafka/topic-xucrhcigfy/65_16/110-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}, {failed}, segment not uploaded
ERROR 2022-07-31 08:16:28,678 [shard 3] archival - [fiber4134 kafka/topic-xucrhcigfy/65] - ntp_archiver_service.cc:148 - Failed to upload 2 segments out of 2
WARN  2022-07-31 08:16:28,695 [shard 2] kafka - connection_context.cc:365 - Error processing request: std::bad_alloc (std::bad_alloc)
DEBUG 2022-07-31 08:16:28,712 [shard 2] cloud_storage - [fiber19 kafka/topic-xucrhcigfy/40] - remote_partition.cc:394 - collecting stale materialized segments, 1 segments materialized, 1 segments total
DEBUG 2022-07-31 08:16:28,712 [shard 2] cloud_storage - [fiber19 kafka/topic-xucrhcigfy/40] - remote_partition.cc:432 - found 0 eviction candidates 
ERROR 2022-07-31 08:16:28,739 [shard 1] cloud_storage - [fiber3360~47~0|1|29915ms] - remote.cc:140 - Unexpected error std::bad_alloc (std::bad_alloc)
WARN  2022-07-31 08:16:28,739 [shard 1] cloud_storage - [fiber3360~47~0|1|29915ms] - remote.cc:456 - Uploading segment {"404c6d6c/kafka/topic-xucrhcigfy/23_16/109-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}, {failed}, segment not uploaded

there is most likely a manifestation of another bug related to leadership query during segment upload: #5755

INFO  2022-07-31 08:16:28,343 [shard 0] raft - [group_id:75, {kafka/topic-xucrhcigfy/74}] vote_stm.cc:255 - becoming the leader term:2
WARN  2022-07-31 08:16:28,764 [shard 0] cloud_storage - [fiber4323~1~0|1|30000ms] - remote.cc:386 - lost leadership or term changed during upload, current leadership status: false, current term: 2, original term: 2: cancelled uploading {"ba5200e3/kafka/topic-xucrhcigfy/74_16/108-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}

here the node has election leaderhip for partition kafka/topic-xucrhcigfy/74 which enables it to start ntp archival, but it does not have strong leadership which makes the segment upload to be cancelled. this may not be the reason for the segfault however.

Aug 02 '22 07:08 abhijat

on further review, bad allocs are throughout the log during segment uploads.

Aug 02 '22 07:08 abhijat

Regarding the error string in the OP (newlines mine):

 failure at 2022-07-31T11:04:20.402Z:
NodeCrash([

(<ducktape.cluster.cluster.ClusterNode object at 0x7f40db68a110>,
"ERROR 2022-07-31 08:16:28,783 [shard 3] assert - Assert failure: (../../../src/v/storage/segment_reader.cc:215) '_parent == nullptr' Must close before destroying\n"), 

(<ducktape.cluster.cluster.ClusterNode object at 0x7f40db68b490>,
"redpanda: /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/core/future.hh:641: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.\n")

])

Am I reading this right that two nodes failed, one with the segment_reader.cc:215 assert but then one with a future.hh:641 assert (in the seastar future internals)?

Aug 03 '22 23:08 travisdowns

Am I reading this right that two nodes failed, one with the segment_reader.cc:215 assert but then one with a future.hh:641 assert (in the seastar future internals)?

Yes. I didn't look closely at the second one because we're in the "all bets are off" mode of bad_alloc exception paths: I've seen that kind of future error before when we're not cleanly shutting down things like input streams.

Aug 04 '22 08:08 jcsp

Probably reocurrence

FAIL test: FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600 (1/1 runs) failure at 2022-08-04T08:40:43.589Z: NodeCrash([(<ducktape.cluster.cluster.ClusterNode object at 0x7f7ec932aa70>, "ERROR 2022-08-04 06:34:41,875 [shard 0] assert - Assert failure: (../../../src/v/storage/segment_reader.cc:215) '_parent == nullptr' Must close before destroying\n")]) in job https://buildkite.com/redpanda/vtools/builds/3104#018266cf-8592-4261-83db-4645611ed820

Aug 05 '22 14:08 jcsp

seastar's abort on bad alloc showed a very large number of 128K entries in its report (this core had 7GB of memory):

Page spans:
index   size    free    used    spans
0       4K      252K    26M     7k
1       8K      72K     3M      394
2       16K     32K     6M      365
3       32K     96K     13M     421
4       64K     32M     747M    12k
5       128K    0B      6384M   51k
6       256K    0B      512K    2
7       512K    0B      0B      0
8       1M      0B      0B      0
9       2M      0B      0B      0
10      4M      0B      0B      0
11      8M      0B      0B      0
12      16M     0B      16M     1
13      32M     0B      0B      0
14      64M     0B      0B      0
15      128M    0B      128M    1
16      256M    0B      0B      0
17      512M    0B      0B      0
18      1G      0B      0B      0
19      2G      0B      0B      0
20      4G      0B      0B      0
21      8G      0B      0B      0
22      16G     0B      0B      0
23      32G     0B      0B      0
24      64G     0B      0B      0
25      128G    0B      0B      0
26      256G    0B      0B      0
27      512G    0B      0B      0
28      1T      0B      0B      0
29      2T      0B      0B      0
30      4T      0B      0B      0
31      8T      0B      0B      0

Changing the write_behind value here https://github.com/redpanda-data/redpanda/blob/a31150a885c6b4598d6bfcead939b4162f20ab3a/src/v/cloud_storage/cache_service.h#L70 to use default_writebehind and setting that value to 1 instead of 10, so 1 instead of 128k, brings the cpu and mem usage way down, it represents the number of buffers to write in parallel.

I did not see bad allocs with that change but saw a (perhaps 503 related) error for which I need to raise another ticket - this is bad log lines not a crash:

WARN  2022-08-05 14:05:03,185 [shard 0] s3 - client.cc:513 - S3 replied with error: HTTP/1.1 503 Service Unavailable
Date: Fri, 05 Aug 22 14:05:03 GMT
Connection: close
x-amz-id-2: VExG3RA6cxTPs8bNKGzyTJyEqmI+N2mt3kCbLUWWmkObzwJqIyhROH98g38ZjeReJFsaenBZJ0d6OGJvTYKRQ5qtJYOqqQZE+VjRK/8QSIdvJ5vTn2gbaw==
x-amz-request-id: B1C5405E3AF8F0BB
Content-Type: application/xml
Server: AmazonS3
Content-Length: 0


ERROR 2022-08-05 14:05:03,185 [shard 0] s3 - client.cc:417 - !!error parse error boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
ERROR 2022-08-05 14:05:03,186 [shard 0] cloud_storage - [fiber9~2~74~0|1|59992ms] - remote.cc:140 - Unexpected error boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
WARN  2022-08-05 14:05:03,186 [shard 0] cloud_storage - [fiber9~2~74~0|1|59992ms] - remote.cc:250 - Failed downloading manifest from {panda-bucket-4f765872-14c6-11ed-986b-ff21a2d1a6da} {failed}, manifest at {"6501bc4c/kafka/topic-sryjfwfzkt/82_16/0-1-v1.log.1.tx"}

Still testing with more runs on cdt.

Aug 05 '22 14:08 abhijat

I made some other changes to the build (reduce size of chunk encoder for http etc) which are probably not related to this issue. Need to also revert those changes and do a clean test on CDT.

Aug 05 '22 14:08 abhijat

tested with just the write_behind count fixed, and still got a bad alloc. the write behind count is 10 and it uses a buffer size of 128KB. will try with a smaller count like 1 and same buffer size.

Page spans:
index   size    free    used    spans
0       4K      764K    17M     5k
1       8K      112K    2M      221
2       16K     752K    3M      254
3       32K     512K    7M      232
4       64K     17M     743M    12k
5       128K    0B      6421M   51k
6       256K    0B      512K    2
7       512K    0B      0B      0
8       1M      0B      0B      0
9       2M      0B      0B      0
10      4M      0B      0B      0
11      8M      0B      0B      0
12      16M     0B      16M     1
13      32M     0B      0B      0
14      64M     0B      0B      0
15      128M    0B      128M    1
16      256M    0B      0B      0
17      512M    0B      0B      0
18      1G      0B      0B      0
19      2G      0B      0B      0
20      4G      0B      0B      0
21      8G      0B      0B      0
22      16G     0B      0B      0
23      32G     0B      0B      0
24      64G     0B      0B      0
25      128G    0B      0B      0
26      256G    0B      0B      0
27      512G    0B      0B      0
28      1T      0B      0B      0
29      2T      0B      0B      0
30      4T      0B      0B      0
31      8T      0B      0B      0

ERROR 2022-08-05 15:55:14,101 [shard 3] seastar - Failed to allocate 131072 bytes
Aborting on shard 3.
Backtrace:
  0x475b3b6
  0x47be282
  0x294b4040f41f
  /opt/redpanda/lib/libc.so.6+0x4300a
  /opt/redpanda/lib/libc.so.6+0x22858
  0x46bf50d
  0x46cd6a5
  0x465655d
  0x47787bf
  0x477c497
  0x47bf965
  0x471934f
  /opt/redpanda/lib/libpthread.so.0+0x8608
  /opt/redpanda/lib/libc.so.6+0x11f132

Aug 05 '22 16:08 abhijat

Not seeing any memory/allocation related errors on write behind (parallel writes) set to 1. Memory usage is stable throughout the test at around 65-70% out of 32 GB on the node.

However with this set of values the test run is really slow and even after running for hours the sequential consumer was not able to consume the full 20k messages (although it made constant, very slow progress). Need to investigate the slowness.

It looks like write behind value for the cache service needs to be configurable as @Lazin suggested.

Aug 05 '22 19:08 abhijat

Excellent investigation @abhijat

Aug 05 '22 23:08 dotnwat

Is this issue something that needs to be resolved for the release? or is this strictly for scale testing?

Aug 05 '22 23:08 dotnwat

Is this issue something that needs to be resolved for the release? or is this strictly for scale testing?

I think we should fix the writebehind value for the cache service from 131072 to the original intended value of 10 before release. I will raise a PR for it soon.

Making it part of adjustable config can be done in a future PR perhaps.

I ran the test overnight and it failed because the random consumer ran into timeout after consuming around 15k of the 20k expected messages.

The broker logs did not show any bad allocs, only some errors related to 503 from amazon which were recovered, no crashes.

Will try once again with a consumer timeout at a very high value to check if it finishes.

Aug 06 '22 04:08 abhijat

Tried several more tests with varying parameters, any value >= 5 for this parameter default_writebehind results in bad allocs in my testing, but there are a few things not clear yet:

the writebehind value bug is very old, at least from 2021, so it should have consistently affected the test but it did not until recently, why this new set of failures?
the segment size in the failing test is 100*2**20 bytes and the cloud storage cache size is 5*2**20 bytes. this seems to result in the cache continuously evicting segments as the random consumer triggers segment downloads. Is the cache size to segment size ratio realistic? What is a good cache size for testing? Assume it would be some multiple of segment size. Even so, why did this test pass earlier even when the cache size is unrealistically small? pandaresults shows a few failures in last 4 months but a majority passes which hints to some recent regression.
The 128 KiB cache service write buffer doesn't seem to cause the bulk of the 51k objects when the shard runs out of memory, I switched it to 256k and when the shard aborted on memory exhaustion the 51k objects were still of 128k size and not 256k.

With increased cache size (around 500MB) the test passes very quickly.

Still have not been able to capture a flame graph of memory usage from the cdt node, gdb complains about missing thread information/mismatch of libthread-db.

Trying to get the redpanda process to dump core on abort currently, as well as attaching a local gdb to gdbserver running on remote, as the flamegraph script works locally and the libpthread on remote is stripped (which apparently can cause gdb not to be able to work with multithreaded applications).

Aug 07 '22 05:08 abhijat

Maybe this is caused by the index encoder. SI index files are compressed and are below 1KiB most of the time but the encoder/decoder are both using iobuf to store data under the hood. And it looks like the max allocation size is 128KiB https://github.com/redpanda-data/redpanda/blob/9949ee880eeb5814ad01fd667d1269303d82ccc6/src/v/bytes/iobuf.h#L217 so maybe, when the segment index is materialized from disk it's somehow allocates 128K chunk.

On Sun, Aug 7, 2022, 07:11 Abhijat Malviya @.***> wrote:

Tried several more tests with varying parameters, any value >= 5 for this parameter default_writebehind results in bad allocs in my testing, but there are a few things not clear yet:

the writebehind value bug is very old, at least from 2021, so it should have consistently affected the test but it did not until recently, why this new set of failures?

the segment size in the failing test is 1002**20 bytes and the cloud storage cache size is 52**20 bytes. this seems to result in the cache continuously evicting segments as the random consumer triggers segment downloads. Is the cache size to segment size ratio realistic? What is a good cache size for testing? Assume it would be some multiple of segment size. Even so, why did this test pass earlier even when the cache size is unrealistically small? pandaresults shows a few failures in last 4 months but a majority passes which hints to some recent regression.

The 128 KiB cache service write buffer doesn't seem to cause the bulk of the 51k objects when the shard runs out of memory, I switched it to 256k and when the shard aborted on memory exhaustion the 51k objects were still of 128k size and not 256k.

With increased cache size (around 500MB) the test passes very quickly.

Still have not been able to capture a flame graph of memory usage from the cdt node, gdb complains about missing thread information/mismatch of libthread-db.

Trying to get the redpanda process to dump core on abort currently, as well as attaching a local gdb to gdbserver running on remote, as the flamegraph script works locally and the libpthread on remote is stripped (which apparently can cause gdb not to be able to work with multithreaded applications).

— Reply to this email directly, view it on GitHub https://github.com/redpanda-data/redpanda/issues/5753#issuecomment-1207332457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWMNTAENVLFIH6DTPNUDLVX5APPANCNFSM55GJIGNA . You are receiving this because you were mentioned.Message ID: @.***>

Aug 07 '22 06:08 Lazin

Maybe it is possible that we are not able to decrement the cache size here https://github.com/redpanda-data/redpanda/blob/dev/src/v/cloud_storage/cache_service.cc#L161 due to shadowing, the cache size variable which is updated is a local variable? It seems to be the only place where cache size is decremented.

So possibly we are always trying to clean up? But if this variable is not shadowing the instance field then this should be okay.

Aug 08 '22 11:08 abhijat

Ran a build on CDT nodes which replaces new per-cache-put eviction with the old 30 second based eviction logic and it consistently passes the test within 4 minutes.

The cache size was set to 5MB with segment size at 100MB for this test. Also saw the cache grow to much more than allowed limit during several times in the test :

ubuntu@ip-172-31-20-3:~/tests/results/2022-08-08--016/FranzGoVerifiableWithSiTest/test_si_without_timeboxed/segment_size=104857600/1/RedpandaService-0-140178880656048$ grep -ERi _cache_size.*,.*max * | cut -d, -f2 | awk -F: '{print $NF}' | sort -nu
 0
 776
 808
 880
 902
 910
 966
 22937600
 23987086
 108011804
 179020539
 274333696
 368802304
 456113390
 526057196
 1751682050
 1794195622
 2546280062
 2859795735
 4098213719

ip-172-31-41-175/redpanda.log:DEBUG 2022-08-08 12:30:47,128 [shard 0] cloud_storage - cache_service.cc:167 - cache::clean_up_cache: _current_cache_size: 4098213719, _max_cache_size: 5242880, candidates_for_deletion: 84

So it seems with the periodic cleanup the consumers in test were able to read during the windows because we were growing the cache to much larger than 5 MB.

With the more strict eviction we have now this no longer happens causing the test to fail. It causes a cycle of download -> cache::put -> evict the only segment in cache.

This still doesn't explain what causes the constantly increasing memory during the cache churn. We should adjust the test to the new eviction system and continue to investigate memory usage during the pathological case (small cache + larger segment size).

Additionally we can add validation to avoid cache size smaller than segment size https://github.com/redpanda-data/redpanda/issues/5896

Aug 08 '22 14:08 abhijat

Probably won't help at this point, but here's another recent occurrence: https://buildkite.com/redpanda/vtools/builds/3133#01827642-8678-41b8-a243-c532aa5f80ce.

Aug 08 '22 14:08 VladLazar

I can reproduce this locally by setting a small cache size eg 5 MB and simply running kgo-verifier like this:

 (main) $ ./kgo-verifier --brokers localhost:9092 --topic $TOPIC --msg_size 128000 --produce_msgs 0 --rand_read_msgs 50 --seq_read=0 --parallel 64

I suspected that the number of read requests was piling up as more and more requests came in so I added a counter in kafka/server/handlers/fetch.cc (requests_in_progress is an atomic uint)

192     cluster_pm.requests_in_progress++;
193     vlog(
194       klog.warn, "READ QUEUED:: reads in progress are: {}", cluster_pm.requests_in_progress);
195     auto read_f = co_await read_from_partition(
196       std::move(*kafka_partition), ntp_config.cfg, foreign_read, deadline);
197     cluster_pm.requests_in_progress--;
198     vlog(
199       klog.warn, "READ FIN:: reads in progress are: {}", cluster_pm.requests_in_progress);
200     co_return read_f;

When running kgo-verifier, the read requests keep steadily increasing, until at one point we start running out of FDs

WARN  2022-08-11 22:27:22,227 [shard 1] seastar - Exceptional future ignored: std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files ["/home/abhijat/redpanda-data/cloud_storage_cache"]), backtrace: 0x3438e99a 0x4655dc54 0x4

this doesnt show up on CDT though so the ulimit was low (1024), but it does show that we opened up a lot of files.

abhijat@terminus-est ~/dev/redpanda
 (dev) $ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 126331
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 126331
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

If enabling queue depth control via kafka_qdc_enable: true in cluster config, the number of requests stays close to 120.

Aug 11 '22 17:08 abhijat

With a number of parallel reads set to 8, with 50 reads and qdc enabled - redpanda is able to finish the requests within a short period of time even with a small cache like 5 MB. But for something like 200 reads / 8 parallel it becomes really slow and takes nearly 10 minutes to finish. It does however finish without running into bad allocs with qdc enabled.

Aug 11 '22 17:08 abhijat

This hasn't reoccurred in the last 30 days.

We have tracking elsewhere for limiting reader concurrency:

https://github.com/redpanda-data/redpanda/issues/6024
https://github.com/redpanda-data/redpanda/issues/6023
https://github.com/redpanda-data/redpanda/pull/6112

So I think we can safely drop this ticket.

Oct 18 '22 13:10 jcsp

redpanda
redpanda copied to clipboard

Crash in `FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600`

redpanda redpanda copied to clipboard

Crash in `FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600`

redpanda
redpanda copied to clipboard