risingwave
risingwave copied to clipboard
Meta node hang DDL processing when connection setup timeout of SASL connection
Describe the bug
The Meta node hangs again which blocked all DDLs. And there are many lines of WARN log of connection timeout of librdkafka:
{"timestamp":"2024-02-24T06:09:17.35828774Z","level":"WARN","fields":{"message":"librdkafka: FAIL [thrd:sasl_ssl://b0-xxx.aws.confluent.cloud:9092/boot]: sasl_ssl://b0-xxx.aws.confluent.cloud:9092/0: Connection setup timed out in state CONNECT (after 30034ms in state CONNECT, 1 identical error(s) suppressed)","log.target":"librdkafka","log.module_path":"madsim_rdkafka::std_::client","log.file":"/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-rdkafka-0.3.0+0.34.0/src/std/client.rs","log.line":78},"target":"librdkafka"}
Meta node cannot process DDL commands and it seems due to the connection timeout of librdkafka. (output of show processlist
)
Error message/log
"message":"librdkafka: FAIL [thrd:sasl_ssl://b0-xxx.aws.confluent.cloud:9092/boot]: sasl_ssl://b0-xxx.aws.confluent.cloud:9092/0: Connection setup timed out in state CONNECT (after 30032ms in state CONNECT, 1 identical error(s) suppressed
https://grafana.prod.risingwave.cloud/explore?panes=%7B%22rF-%22:%7B%22datasource%22:%22P5EC303186A5DB006%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bapp%3D%5C%22risingwave-meta-default-0%5C%22,%20namespace%3D%5C%22rwc-g1hmdvc3u9f88otor7j1kbpin2-thumbtack-prod-poc%5C%22%7D%20%7C~%20%60%28WARN%7CERROR%29%60%20%7C%20json%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22P5EC303186A5DB006%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221708750800000%22,%22to%22:%221708756259000%22%7D%7D%7D&schemaVersion=1&orgId=1
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
tenant: https://grafana.prod.risingwave.cloud/d/AdminDashboard_Tenant/tenant?var-datasource=PE662C12516FAE815&var-id=3&orgId=1
The version of RisingWave
PostgreSQL 9.5-RisingWave-1.6.1 (02ee186211e44001c645027bf5aca3db5f076d29)
Additional context
No response
cc @tabVersion @yezizp2012
Caused by https://github.com/confluentinc/librdkafka/pull/4460 @wangrunji0408 please help to update madsim-rdkafka and patch to release branch of v1.6 and v1.7, thanks.
#15313 for main #15314 for release-1.7 #15315 for release-1.6