redpanda
redpanda copied to clipboard
Crash in `FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600`
FAIL test: FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600 (1/1 runs)
failure at 2022-07-31T11:04:20.402Z: NodeCrash([(<ducktape.cluster.cluster.ClusterNode object at 0x7f40db68a110>, "ERROR 2022-07-31 08:16:28,783 [shard 3] assert - Assert failure: (../../../src/v/storage/segment_reader.cc:215) '_parent == nullptr' Must close before destroying\n"), (<ducktape.cluster.cluster.ClusterNode object at 0x7f40db68b490>, "redpanda: /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/core/future.hh:641: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.\n")])
https://buildkite.com/redpanda/vtools/builds/3073#018252da-dd82-4dac-bad9-ea37369ded6e
Checked if this was related to #5613 and no: this one is a storage assertion, that one is a segfault.
If I had to guess, my suspicion would be that this is something in the SI cache changes, tearing down readers uncleanly on eviction perhaps.
there are a couple of issues I can (17gb log file so still analysing)
- there are bad allocs right before the assertion failure
DEBUG 2022-07-31 08:16:28,668 [shard 3] cloud_storage - [fiber4134~1~0|1|29999ms] - remote.cc:393 - Uploading segment to path {"992bbf98/kafka/topic-xucrhcigfy/65_16/105-1-v1.log.2"}
DEBUG 2022-07-31 08:16:28,668 [shard 3] cloud_storage - [fiber4134~3~0|1|29999ms] - remote.cc:393 - Uploading segment to path {"668a0054/kafka/topic-xucrhcigfy/65_16/110-1-v1.log.2"}
INFO 2022-07-31 08:16:28,669 [shard 3] cluster - ntp: {kafka/topic-xucrhcigfy/93} - archival_metadata_stm.cc:405 - creating snapshot at offset: 188, remote start_offset: 0, last_offset: 106
ERROR 2022-07-31 08:16:28,671 [shard 3] cloud_storage - [fiber4134~1~0|1|29996ms] - remote.cc:140 - Unexpected error std::bad_alloc (std::bad_alloc)
WARN 2022-07-31 08:16:28,671 [shard 3] cloud_storage - [fiber4134~1~0|1|29996ms] - remote.cc:456 - Uploading segment {"992bbf98/kafka/topic-xucrhcigfy/65_16/105-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}, {failed}, segment not uploaded
ERROR 2022-07-31 08:16:28,674 [shard 1] cloud_storage - [fiber3360~49~0|1|29980ms] - remote.cc:140 - Unexpected error std::bad_alloc (std::bad_alloc)
WARN 2022-07-31 08:16:28,674 [shard 1] cloud_storage - [fiber3360~49~0|1|29980ms] - remote.cc:456 - Uploading segment {"d2dcc8b9/kafka/topic-xucrhcigfy/23_16/212-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}, {failed}, segment not uploaded
ERROR 2022-07-31 08:16:28,678 [shard 3] cloud_storage - [fiber4134~3~0|1|29990ms] - remote.cc:140 - Unexpected error std::bad_alloc (std::bad_alloc)
WARN 2022-07-31 08:16:28,678 [shard 3] cloud_storage - [fiber4134~3~0|1|29990ms] - remote.cc:456 - Uploading segment {"668a0054/kafka/topic-xucrhcigfy/65_16/110-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}, {failed}, segment not uploaded
ERROR 2022-07-31 08:16:28,678 [shard 3] archival - [fiber4134 kafka/topic-xucrhcigfy/65] - ntp_archiver_service.cc:148 - Failed to upload 2 segments out of 2
WARN 2022-07-31 08:16:28,695 [shard 2] kafka - connection_context.cc:365 - Error processing request: std::bad_alloc (std::bad_alloc)
DEBUG 2022-07-31 08:16:28,712 [shard 2] cloud_storage - [fiber19 kafka/topic-xucrhcigfy/40] - remote_partition.cc:394 - collecting stale materialized segments, 1 segments materialized, 1 segments total
DEBUG 2022-07-31 08:16:28,712 [shard 2] cloud_storage - [fiber19 kafka/topic-xucrhcigfy/40] - remote_partition.cc:432 - found 0 eviction candidates
ERROR 2022-07-31 08:16:28,739 [shard 1] cloud_storage - [fiber3360~47~0|1|29915ms] - remote.cc:140 - Unexpected error std::bad_alloc (std::bad_alloc)
WARN 2022-07-31 08:16:28,739 [shard 1] cloud_storage - [fiber3360~47~0|1|29915ms] - remote.cc:456 - Uploading segment {"404c6d6c/kafka/topic-xucrhcigfy/23_16/109-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}, {failed}, segment not uploaded
- there is most likely a manifestation of another bug related to leadership query during segment upload: #5755
INFO 2022-07-31 08:16:28,343 [shard 0] raft - [group_id:75, {kafka/topic-xucrhcigfy/74}] vote_stm.cc:255 - becoming the leader term:2
WARN 2022-07-31 08:16:28,764 [shard 0] cloud_storage - [fiber4323~1~0|1|30000ms] - remote.cc:386 - lost leadership or term changed during upload, current leadership status: false, current term: 2, original term: 2: cancelled uploading {"ba5200e3/kafka/topic-xucrhcigfy/74_16/108-1-v1.log.2"} to {panda-bucket-c7ecfdc4-109d-11ed-8c86-57a9d452750b}
here the node has election leaderhip for partition kafka/topic-xucrhcigfy/74 which enables it to start ntp archival, but it does not have strong leadership which makes the segment upload to be cancelled. this may not be the reason for the segfault however.
on further review, bad allocs are throughout the log during segment uploads.
Regarding the error string in the OP (newlines mine):
failure at 2022-07-31T11:04:20.402Z:
NodeCrash([
(<ducktape.cluster.cluster.ClusterNode object at 0x7f40db68a110>,
"ERROR 2022-07-31 08:16:28,783 [shard 3] assert - Assert failure: (../../../src/v/storage/segment_reader.cc:215) '_parent == nullptr' Must close before destroying\n"),
(<ducktape.cluster.cluster.ClusterNode object at 0x7f40db68b490>,
"redpanda: /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/core/future.hh:641: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.\n")
])
Am I reading this right that two nodes failed, one with the segment_reader.cc:215 assert but then one with a future.hh:641 assert (in the seastar future internals)?
Am I reading this right that two nodes failed, one with the segment_reader.cc:215 assert but then one with a future.hh:641 assert (in the seastar future internals)?
Yes. I didn't look closely at the second one because we're in the "all bets are off" mode of bad_alloc exception paths: I've seen that kind of future error before when we're not cleanly shutting down things like input streams.
Probably reocurrence
FAIL test: FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600 (1/1 runs) failure at 2022-08-04T08:40:43.589Z: NodeCrash([(<ducktape.cluster.cluster.ClusterNode object at 0x7f7ec932aa70>, "ERROR 2022-08-04 06:34:41,875 [shard 0] assert - Assert failure: (../../../src/v/storage/segment_reader.cc:215) '_parent == nullptr' Must close before destroying\n")]) in job https://buildkite.com/redpanda/vtools/builds/3104#018266cf-8592-4261-83db-4645611ed820
seastar's abort on bad alloc showed a very large number of 128K entries in its report (this core had 7GB of memory):
Page spans:
index size free used spans
0 4K 252K 26M 7k
1 8K 72K 3M 394
2 16K 32K 6M 365
3 32K 96K 13M 421
4 64K 32M 747M 12k
5 128K 0B 6384M 51k
6 256K 0B 512K 2
7 512K 0B 0B 0
8 1M 0B 0B 0
9 2M 0B 0B 0
10 4M 0B 0B 0
11 8M 0B 0B 0
12 16M 0B 16M 1
13 32M 0B 0B 0
14 64M 0B 0B 0
15 128M 0B 128M 1
16 256M 0B 0B 0
17 512M 0B 0B 0
18 1G 0B 0B 0
19 2G 0B 0B 0
20 4G 0B 0B 0
21 8G 0B 0B 0
22 16G 0B 0B 0
23 32G 0B 0B 0
24 64G 0B 0B 0
25 128G 0B 0B 0
26 256G 0B 0B 0
27 512G 0B 0B 0
28 1T 0B 0B 0
29 2T 0B 0B 0
30 4T 0B 0B 0
31 8T 0B 0B 0
Changing the write_behind value here https://github.com/redpanda-data/redpanda/blob/a31150a885c6b4598d6bfcead939b4162f20ab3a/src/v/cloud_storage/cache_service.h#L70 to use default_writebehind and setting that value to 1 instead of 10, so 1 instead of 128k, brings the cpu and mem usage way down, it represents the number of buffers to write in parallel.
I did not see bad allocs with that change but saw a (perhaps 503 related) error for which I need to raise another ticket - this is bad log lines not a crash:
WARN 2022-08-05 14:05:03,185 [shard 0] s3 - client.cc:513 - S3 replied with error: HTTP/1.1 503 Service Unavailable
Date: Fri, 05 Aug 22 14:05:03 GMT
Connection: close
x-amz-id-2: VExG3RA6cxTPs8bNKGzyTJyEqmI+N2mt3kCbLUWWmkObzwJqIyhROH98g38ZjeReJFsaenBZJ0d6OGJvTYKRQ5qtJYOqqQZE+VjRK/8QSIdvJ5vTn2gbaw==
x-amz-request-id: B1C5405E3AF8F0BB
Content-Type: application/xml
Server: AmazonS3
Content-Length: 0
ERROR 2022-08-05 14:05:03,185 [shard 0] s3 - client.cc:417 - !!error parse error boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
ERROR 2022-08-05 14:05:03,186 [shard 0] cloud_storage - [fiber9~2~74~0|1|59992ms] - remote.cc:140 - Unexpected error boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
WARN 2022-08-05 14:05:03,186 [shard 0] cloud_storage - [fiber9~2~74~0|1|59992ms] - remote.cc:250 - Failed downloading manifest from {panda-bucket-4f765872-14c6-11ed-986b-ff21a2d1a6da} {failed}, manifest at {"6501bc4c/kafka/topic-sryjfwfzkt/82_16/0-1-v1.log.1.tx"}
Still testing with more runs on cdt.
I made some other changes to the build (reduce size of chunk encoder for http etc) which are probably not related to this issue. Need to also revert those changes and do a clean test on CDT.
tested with just the write_behind count fixed, and still got a bad alloc. the write behind count is 10 and it uses a buffer size of 128KB. will try with a smaller count like 1 and same buffer size.
Page spans:
index size free used spans
0 4K 764K 17M 5k
1 8K 112K 2M 221
2 16K 752K 3M 254
3 32K 512K 7M 232
4 64K 17M 743M 12k
5 128K 0B 6421M 51k
6 256K 0B 512K 2
7 512K 0B 0B 0
8 1M 0B 0B 0
9 2M 0B 0B 0
10 4M 0B 0B 0
11 8M 0B 0B 0
12 16M 0B 16M 1
13 32M 0B 0B 0
14 64M 0B 0B 0
15 128M 0B 128M 1
16 256M 0B 0B 0
17 512M 0B 0B 0
18 1G 0B 0B 0
19 2G 0B 0B 0
20 4G 0B 0B 0
21 8G 0B 0B 0
22 16G 0B 0B 0
23 32G 0B 0B 0
24 64G 0B 0B 0
25 128G 0B 0B 0
26 256G 0B 0B 0
27 512G 0B 0B 0
28 1T 0B 0B 0
29 2T 0B 0B 0
30 4T 0B 0B 0
31 8T 0B 0B 0
ERROR 2022-08-05 15:55:14,101 [shard 3] seastar - Failed to allocate 131072 bytes
Aborting on shard 3.
Backtrace:
0x475b3b6
0x47be282
0x294b4040f41f
/opt/redpanda/lib/libc.so.6+0x4300a
/opt/redpanda/lib/libc.so.6+0x22858
0x46bf50d
0x46cd6a5
0x465655d
0x47787bf
0x477c497
0x47bf965
0x471934f
/opt/redpanda/lib/libpthread.so.0+0x8608
/opt/redpanda/lib/libc.so.6+0x11f132
Not seeing any memory/allocation related errors on write behind (parallel writes) set to 1. Memory usage is stable throughout the test at around 65-70% out of 32 GB on the node.
However with this set of values the test run is really slow and even after running for hours the sequential consumer was not able to consume the full 20k messages (although it made constant, very slow progress). Need to investigate the slowness.
It looks like write behind value for the cache service needs to be configurable as @Lazin suggested.
Excellent investigation @abhijat
Is this issue something that needs to be resolved for the release? or is this strictly for scale testing?
Is this issue something that needs to be resolved for the release? or is this strictly for scale testing?
I think we should fix the writebehind value for the cache service from 131072 to the original intended value of 10 before release. I will raise a PR for it soon.
Making it part of adjustable config can be done in a future PR perhaps.
I ran the test overnight and it failed because the random consumer ran into timeout after consuming around 15k of the 20k expected messages.
The broker logs did not show any bad allocs, only some errors related to 503 from amazon which were recovered, no crashes.
Will try once again with a consumer timeout at a very high value to check if it finishes.
Tried several more tests with varying parameters, any value >= 5 for this parameter default_writebehind results in bad allocs in my testing, but there are a few things not clear yet:
- the writebehind value bug is very old, at least from 2021, so it should have consistently affected the test but it did not until recently, why this new set of failures?
- the segment size in the failing test is
100*2**20bytes and the cloud storage cache size is5*2**20bytes. this seems to result in the cache continuously evicting segments as the random consumer triggers segment downloads. Is the cache size to segment size ratio realistic? What is a good cache size for testing? Assume it would be some multiple of segment size. Even so, why did this test pass earlier even when the cache size is unrealistically small? pandaresults shows a few failures in last 4 months but a majority passes which hints to some recent regression. - The 128 KiB cache service write buffer doesn't seem to cause the bulk of the 51k objects when the shard runs out of memory, I switched it to 256k and when the shard aborted on memory exhaustion the 51k objects were still of 128k size and not 256k.
With increased cache size (around 500MB) the test passes very quickly.
Still have not been able to capture a flame graph of memory usage from the cdt node, gdb complains about missing thread information/mismatch of libthread-db.
Trying to get the redpanda process to dump core on abort currently, as well as attaching a local gdb to gdbserver running on remote, as the flamegraph script works locally and the libpthread on remote is stripped (which apparently can cause gdb not to be able to work with multithreaded applications).
Maybe this is caused by the index encoder. SI index files are compressed and are below 1KiB most of the time but the encoder/decoder are both using iobuf to store data under the hood. And it looks like the max allocation size is 128KiB https://github.com/redpanda-data/redpanda/blob/9949ee880eeb5814ad01fd667d1269303d82ccc6/src/v/bytes/iobuf.h#L217 so maybe, when the segment index is materialized from disk it's somehow allocates 128K chunk.
On Sun, Aug 7, 2022, 07:11 Abhijat Malviya @.***> wrote:
Tried several more tests with varying parameters, any value >= 5 for this parameter default_writebehind results in bad allocs in my testing, but there are a few things not clear yet:
- the writebehind value bug is very old, at least from 2021, so it should have consistently affected the test but it did not until recently, why this new set of failures?
- the segment size in the failing test is 1002**20 bytes and the cloud storage cache size is 52**20 bytes. this seems to result in the cache continuously evicting segments as the random consumer triggers segment downloads. Is the cache size to segment size ratio realistic? What is a good cache size for testing? Assume it would be some multiple of segment size. Even so, why did this test pass earlier even when the cache size is unrealistically small? pandaresults shows a few failures in last 4 months but a majority passes which hints to some recent regression.
- The 128 KiB cache service write buffer doesn't seem to cause the bulk of the 51k objects when the shard runs out of memory, I switched it to 256k and when the shard aborted on memory exhaustion the 51k objects were still of 128k size and not 256k.
With increased cache size (around 500MB) the test passes very quickly.
Still have not been able to capture a flame graph of memory usage from the cdt node, gdb complains about missing thread information/mismatch of libthread-db.
Trying to get the redpanda process to dump core on abort currently, as well as attaching a local gdb to gdbserver running on remote, as the flamegraph script works locally and the libpthread on remote is stripped (which apparently can cause gdb not to be able to work with multithreaded applications).
— Reply to this email directly, view it on GitHub https://github.com/redpanda-data/redpanda/issues/5753#issuecomment-1207332457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWMNTAENVLFIH6DTPNUDLVX5APPANCNFSM55GJIGNA . You are receiving this because you were mentioned.Message ID: @.***>
Maybe it is possible that we are not able to decrement the cache size here https://github.com/redpanda-data/redpanda/blob/dev/src/v/cloud_storage/cache_service.cc#L161 due to shadowing, the cache size variable which is updated is a local variable? It seems to be the only place where cache size is decremented.
So possibly we are always trying to clean up? But if this variable is not shadowing the instance field then this should be okay.
Ran a build on CDT nodes which replaces new per-cache-put eviction with the old 30 second based eviction logic and it consistently passes the test within 4 minutes.
The cache size was set to 5MB with segment size at 100MB for this test. Also saw the cache grow to much more than allowed limit during several times in the test :
ubuntu@ip-172-31-20-3:~/tests/results/2022-08-08--016/FranzGoVerifiableWithSiTest/test_si_without_timeboxed/segment_size=104857600/1/RedpandaService-0-140178880656048$ grep -ERi _cache_size.*,.*max * | cut -d, -f2 | awk -F: '{print $NF}' | sort -nu
0
776
808
880
902
910
966
22937600
23987086
108011804
179020539
274333696
368802304
456113390
526057196
1751682050
1794195622
2546280062
2859795735
4098213719
ip-172-31-41-175/redpanda.log:DEBUG 2022-08-08 12:30:47,128 [shard 0] cloud_storage - cache_service.cc:167 - cache::clean_up_cache: _current_cache_size: 4098213719, _max_cache_size: 5242880, candidates_for_deletion: 84
So it seems with the periodic cleanup the consumers in test were able to read during the windows because we were growing the cache to much larger than 5 MB.
With the more strict eviction we have now this no longer happens causing the test to fail. It causes a cycle of download -> cache::put -> evict the only segment in cache.
This still doesn't explain what causes the constantly increasing memory during the cache churn. We should adjust the test to the new eviction system and continue to investigate memory usage during the pathological case (small cache + larger segment size).
Additionally we can add validation to avoid cache size smaller than segment size https://github.com/redpanda-data/redpanda/issues/5896
Probably won't help at this point, but here's another recent occurrence: https://buildkite.com/redpanda/vtools/builds/3133#01827642-8678-41b8-a243-c532aa5f80ce.
I can reproduce this locally by setting a small cache size eg 5 MB and simply running kgo-verifier like this:
(main) $ ./kgo-verifier --brokers localhost:9092 --topic $TOPIC --msg_size 128000 --produce_msgs 0 --rand_read_msgs 50 --seq_read=0 --parallel 64
I suspected that the number of read requests was piling up as more and more requests came in so I added a counter in kafka/server/handlers/fetch.cc (requests_in_progress is an atomic uint)
192 cluster_pm.requests_in_progress++;
193 vlog(
194 klog.warn, "READ QUEUED:: reads in progress are: {}", cluster_pm.requests_in_progress);
195 auto read_f = co_await read_from_partition(
196 std::move(*kafka_partition), ntp_config.cfg, foreign_read, deadline);
197 cluster_pm.requests_in_progress--;
198 vlog(
199 klog.warn, "READ FIN:: reads in progress are: {}", cluster_pm.requests_in_progress);
200 co_return read_f;
When running kgo-verifier, the read requests keep steadily increasing, until at one point we start running out of FDs
WARN 2022-08-11 22:27:22,227 [shard 1] seastar - Exceptional future ignored: std::__1::__fs::filesystem::filesystem_error (error system:24, filesystem error: open failed: Too many open files ["/home/abhijat/redpanda-data/cloud_storage_cache"]), backtrace: 0x3438e99a 0x4655dc54 0x4
this doesnt show up on CDT though so the ulimit was low (1024), but it does show that we opened up a lot of files.
abhijat@terminus-est ~/dev/redpanda
(dev) $ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 126331
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 126331
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
If enabling queue depth control via kafka_qdc_enable: true in cluster config, the number of requests stays close to 120.
With a number of parallel reads set to 8, with 50 reads and qdc enabled - redpanda is able to finish the requests within a short period of time even with a small cache like 5 MB. But for something like 200 reads / 8 parallel it becomes really slow and takes nearly 10 minutes to finish. It does however finish without running into bad allocs with qdc enabled.
This hasn't reoccurred in the last 30 days.
We have tracking elsewhere for limiting reader concurrency:
- https://github.com/redpanda-data/redpanda/issues/6024
- https://github.com/redpanda-data/redpanda/issues/6023
- https://github.com/redpanda-data/redpanda/pull/6112
So I think we can safely drop this ticket.