[Bug]: sysbench test reported error FATAL: Worker threads failed to initialize within 30 seconds!
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
main
Commit ID
6a43a6e7a
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9567035983/job/26400260798
mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22kbI%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240618%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718774634000%22,%22to%22:%221718774664000%22%7D%7D%7D&schemaVersion=1&orgId=1
Expected Behavior
No response
Steps to Reproduce
tke regression sysbench1000w nopk-load and nopk-point-select test
Additional information
No response
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22aaX%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bmatrixorigin_io_namespace%3D%5C%22mo-error-case--nightly-57a9e8776-20240624%5C%22%7D%20%7C%3D%20%60slow%20event%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-12h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1
没有记录到慢的 I/O,不是 io 开销过大的问题。
原始错误信息
在 2024-06-24 04:00 左右的错误信息:https://grafana.ci.matrixorigin.cn/goto/KtEhzSQIR?orgId=1
可以看到有大量的 failed to init stats info for table 以及 connect: cannot assign requested address。后者似乎是因为fd耗尽,无法bind导致,可能是有网络连接未关闭,或者同时发起太多。
而且只有 04:00 这个时间有 connect: cannot assign requested address 的错误信息:https://grafana.ci.matrixorigin.cn/goto/Agb4iSwIR?orgId=1
Is there an existing issue for the same bug?
* [x] I have checked the existing issues.Branch Name
main
Commit ID
Other Environment Information
- Hardware parameters: - OS type: - Others:Actual Behavior
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9567035983/job/26400260798
mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22kbI%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240618%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718774634000%22,%22to%22:%221718774664000%22%7D%7D%7D&schemaVersion=1&orgId=1
Expected Behavior
No response
Steps to Reproduce
tke regression sysbench1000w nopk-load and nopk-point-select testAdditional information
No response
这里又是另一个问题:https://grafana.ci.matrixorigin.cn/goto/nyU3iIQIR?orgId=1
看上去 cannot get table by ID 是主因,也是 disttae 里的 stats 相关。
两次重现,都和 disttae 包内的 stats 相关,先转给 @volgariver6 做进一步分析
用改过之后的分支跑,还是会有问题,目前没有定位到具体原因,只是观察到在出问题的时候,创建了大量的goroutines。
尚无进展
正在测试
目前测试还是会报错,麻烦 @reusee 继续跟踪一下
https://github.com/matrixorigin/mo-auto-test/actions/runs/9890430540 测试两次没有问题了
confirm,closed commit:88e486e11 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9976227425
0816: commit:081becbc9158278b3bd974075a40e3539aafa1db
job url:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/10405469110/job/28842654264
第一次出现 FATAL的log:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22F03%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-main-nightly-081becbc9-20240815%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221723786939000%22,%22to%22:%221723786969000%22%7D%7D%7D&schemaVersion=1&orgId=1
@Ariznawlll 这个麻烦跑一下二分吧,估计是最近的哪个pr引起
等待二分结果
二分查到是18100pr,cc @ouyuanning
job:https://github.com/matrixorigin/mo-auto-test/actions/runs/10508265514/job/29112655376
二分查到是18100pr,cc @ouyuanning
job:https://github.com/matrixorigin/mo-auto-test/actions/runs/10508265514/job/29112655376
@heni02 确定是 18100 这个pr 么? 这个pr 只是调用了reader.Close , reader.Close 里面啥都没干,就是把一个迭代器复位了下.
报错跟#17793一样, 目前1.2-dev也发现类似问题。应该不是二分到的这2个PR引入的
更新下1.2报错信息:
1.2-dev commit:59d156b46
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/10634487640/job/29503486397
mo log:
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22qFM%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-nightly-59d156b46-20240830%5C%22%7D%20%7C%3D%20%60operation%20error%20S3:%20GetObject,%20exceeded%20maximum%20number%20of%20attempts,%203%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221725083373948%22,%22to%22:%221725083702716%22%7D%7D%7D&schemaVersion=1&orgId=1
需要采集进程的socket信息,才能确认泄漏源头
无进展
无进展
无进展
1.2-dev commit:b0f3a5481复现了,环境已配置相关参数
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/11142247340/job/30999607062
yaml:
https://github.com/matrixorigin/mo-nightly-regression/blob/main/mo-bench-tke.yaml
threads初始时间:09:36:09
报错时间:09:36:39
mo log:
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22Wjp%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-nightly-b0f3a5481-20241002%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221727919369000%22,%22to%22:%221727919399000%22%7D%7D%7D&schemaVersion=1&orgId=1
main commit:94cbcf83d
复现了该问题
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/11276243006/job/31388222596
日志: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22Q0w%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-main-nightly-94cbcf83d-20241010%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221728623580000%22,%22to%22:%221728623640000%22%7D%7D%7D&schemaVersion=1&orgId=1
https://grafana.ci.matrixorigin.cn/goto/PtIcrekNg?orgId=1
从日志看,dest port 为 443 的连接数量达到两万八。
但 fileservice 内部已经限制了并发的 reader 的数量,暂时不清楚哪里导致了这个问题。
working on other issues.
最近2两天的REGRESSION都没有在出现过改问题,改为testing状态


