matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: mo can not connect for error "internal error: no available CN server" during choas test (kill one dn pod continously interval 10 minutes)

Open aressu1985 opened this issue 1 year ago • 10 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

2.0-dev

Commit ID

410a540

Other Environment Information

- Hardware parameters:
3*CN: 7C 28G
1*DN: 7C 28G
3*PROXY: 2C 5G
3*LOG: 1C 7G
- OS type:
- Others:

Actual Behavior

[test load] run tpcc 10-10 insert data to a table with 2 thread and during the test, the chaos tool were continuously kill one tn pod by interval 10 mins

[issue] after about 3 hours, mo can not connect for error "internal error: no available CN server : [github@mo-srv-128 stability-test]$ mysql -h 10.222.6.253 -utpcc_test:admin -p111 -P6001 mysql: [Warning] Using a password on the command line interface can be insecure. ERROR 20101 (HY000): internal error: no available CN server

mo-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22NUU%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-chaos-bba26ea-202501081143%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221736314556557%22,%22to%22:%221736336101600%22%7D%7D%7D&schemaVersion=1&orgId=1

TN goroutine: goroutine.log

cn goroutine: CN_31306230-3661-6263-3235-613739373131_goroutine_0194455d-dca2-780f-b049-5cc5f4c33365.gz CN_31306230-3661-6263-3235-613739373131_goroutine_0194455d-6774-7dc0-b12b-4c8b8cd46ec5.gz

Expected Behavior

No response

Steps to Reproduce

[test load]
run tpcc 10-10
insert data to a table with 2 thread
and during the test, the chaos tool were continuously kill one log pod by interval 10 mins

Additional information

No response

aressu1985 avatar Jan 08 '25 11:01 aressu1985

@badboynt1

XuPeng-SH avatar Jan 08 '25 11:01 XuPeng-SH

@ouyuanning

XuPeng-SH avatar Jan 08 '25 11:01 XuPeng-SH

今天跑了1次20分钟。但是间隔10秒就kill tn的。碰到了cn panic的问题。已经另外建issue跟踪

另外跑了1次1个小时的。间隔80秒kill tn。然后间隔5秒重新启动tn。没有复现

ouyuanning avatar Jan 09 '25 10:01 ouyuanning

今天用最新的2.0-dev没有跑出来。 用issue提到的commit跑出来的问题,经确认是已fixed的问题

ouyuanning avatar Jan 10 '25 13:01 ouyuanning

没有复现

ouyuanning avatar Jan 13 '25 11:01 ouyuanning

非REGRESSION问题,DELAY 到后续版本解决

aressu1985 avatar Jan 13 '25 12:01 aressu1985

repro on 2025.01.15

mo-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22aEr%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-chaos-2f6f3d7-202501150026%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221736887874693%22,%22to%22:%221736909437995%22%7D%7D%7D&schemaVersion=1&orgId=1

goroutine: CN_32343662-3437-3332-6363-316364383262_goroutine_019467da-e89b-742b-8cef-859a22319cdf.gz CN_32343662-3437-3332-6363-316364383262_goroutine_019467dc-48c1-7c0c-a364-11583bbaa27d.gz CN_32343662-3437-3332-6363-316364383262_goroutine_019467db-d19f-7b58-988f-8e9e31f2bd4e.gz CN_32343662-3437-3332-6363-316364383262_goroutine_019467db-5c49-7497-a4d6-d6c3ae2c75a8.gz

aressu1985 avatar Jan 15 '25 02:01 aressu1985

https://github.com/matrixorigin/matrixone/pull/21279 合并后。 当前问题依然存在。估计跟4771不是一个问题

ouyuanning avatar Jan 20 '25 08:01 ouyuanning

请年假了

ouyuanning avatar Jan 23 '25 11:01 ouyuanning

请年假了

ouyuanning avatar Jan 28 '25 11:01 ouyuanning

should be fixed by https://github.com/matrixorigin/matrixone/pull/21692

XuPeng-SH avatar Jul 02 '25 02:07 XuPeng-SH

最新2.2版本故障测试已无该问题,先closed

heni02 avatar Aug 06 '25 03:08 heni02