matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: sysbench test reported error FATAL: Worker threads failed to initialize within 30 seconds!

Open heni02 opened this issue 1 year ago • 7 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

6a43a6e7a

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9567035983/job/26400260798 企业微信截图_d7a65ef7-a6e7-42e3-b87e-ae6716c720ba

mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22kbI%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240618%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718774634000%22,%22to%22:%221718774664000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

tke regression sysbench1000w nopk-load and nopk-point-select test

Additional information

No response

heni02 avatar Jun 19 '24 09:06 heni02

https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22aaX%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bmatrixorigin_io_namespace%3D%5C%22mo-error-case--nightly-57a9e8776-20240624%5C%22%7D%20%7C%3D%20%60slow%20event%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-12h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1

没有记录到慢的 I/O,不是 io 开销过大的问题。

reusee avatar Jun 24 '24 04:06 reusee

原始错误信息 企业微信截图_a339aa93-a068-4a68-9016-dc1be1b8adfc

reusee avatar Jun 24 '24 04:06 reusee

在 2024-06-24 04:00 左右的错误信息:https://grafana.ci.matrixorigin.cn/goto/KtEhzSQIR?orgId=1

可以看到有大量的 failed to init stats info for table 以及 connect: cannot assign requested address。后者似乎是因为fd耗尽,无法bind导致,可能是有网络连接未关闭,或者同时发起太多。

而且只有 04:00 这个时间有 connect: cannot assign requested address 的错误信息:https://grafana.ci.matrixorigin.cn/goto/Agb4iSwIR?orgId=1

reusee avatar Jun 24 '24 04:06 reusee

Is there an existing issue for the same bug?

* [x]  I have checked the existing issues.

Branch Name

main

Commit ID

6a43a6e

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9567035983/job/26400260798 企业微信截图_d7a65ef7-a6e7-42e3-b87e-ae6716c720ba

mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22kbI%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240618%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718774634000%22,%22to%22:%221718774664000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

tke regression sysbench1000w nopk-load and nopk-point-select test

Additional information

No response

这里又是另一个问题:https://grafana.ci.matrixorigin.cn/goto/nyU3iIQIR?orgId=1

看上去 cannot get table by ID 是主因,也是 disttae 里的 stats 相关。

reusee avatar Jun 24 '24 05:06 reusee

两次重现,都和 disttae 包内的 stats 相关,先转给 @volgariver6 做进一步分析

reusee avatar Jun 24 '24 05:06 reusee

用改过之后的分支跑,还是会有问题,目前没有定位到具体原因,只是观察到在出问题的时候,创建了大量的goroutines。

volgariver6 avatar Jun 27 '24 12:06 volgariver6

尚无进展

volgariver6 avatar Jul 03 '24 15:07 volgariver6

正在测试

volgariver6 avatar Jul 08 '24 11:07 volgariver6

目前测试还是会报错,麻烦 @reusee 继续跟踪一下

volgariver6 avatar Jul 09 '24 07:07 volgariver6

https://github.com/matrixorigin/mo-auto-test/actions/runs/9890430540 测试两次没有问题了

XuPeng-SH avatar Jul 12 '24 02:07 XuPeng-SH

confirm,closed commit:88e486e11 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9976227425

heni02 avatar Jul 19 '24 04:07 heni02

0816: commit:081becbc9158278b3bd974075a40e3539aafa1db

job url:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/10405469110/job/28842654264

image

第一次出现 FATAL的log:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22F03%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-main-nightly-081becbc9-20240815%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221723786939000%22,%22to%22:%221723786969000%22%7D%7D%7D&schemaVersion=1&orgId=1

Ariznawlll avatar Aug 16 '24 07:08 Ariznawlll

@Ariznawlll 这个麻烦跑一下二分吧,估计是最近的哪个pr引起

volgariver6 avatar Aug 16 '24 08:08 volgariver6

等待二分结果

volgariver6 avatar Aug 21 '24 10:08 volgariver6

二分查到是18100pr,cc @ouyuanning 企业微信截图_95491f00-ac90-4023-b160-ddc6e6b464a3

job:https://github.com/matrixorigin/mo-auto-test/actions/runs/10508265514/job/29112655376 企业微信截图_94e9defb-bbe9-4b55-b7b7-fe958119c386

heni02 avatar Aug 23 '24 09:08 heni02

二分查到是18100pr,cc @ouyuanning 企业微信截图_95491f00-ac90-4023-b160-ddc6e6b464a3

job:https://github.com/matrixorigin/mo-auto-test/actions/runs/10508265514/job/29112655376 企业微信截图_94e9defb-bbe9-4b55-b7b7-fe958119c386

@heni02 确定是 18100 这个pr 么? 这个pr 只是调用了reader.Close , reader.Close 里面啥都没干,就是把一个迭代器复位了下.

triump2020 avatar Aug 27 '24 08:08 triump2020

报错跟#17793一样, 目前1.2-dev也发现类似问题。应该不是二分到的这2个PR引入的

ouyuanning avatar Aug 31 '24 08:08 ouyuanning

更新下1.2报错信息: 1.2-dev commit:59d156b46 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/10634487640/job/29503486397 企业微信截图_921e8c75-c715-416e-90b4-903e79e0d9db mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22qFM%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-nightly-59d156b46-20240830%5C%22%7D%20%7C%3D%20%60operation%20error%20S3:%20GetObject,%20exceeded%20maximum%20number%20of%20attempts,%203%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221725083373948%22,%22to%22:%221725083702716%22%7D%7D%7D&schemaVersion=1&orgId=1

heni02 avatar Aug 31 '24 10:08 heni02

需要采集进程的socket信息,才能确认泄漏源头

reusee avatar Sep 11 '24 15:09 reusee

无进展

reusee avatar Sep 18 '24 12:09 reusee

无进展

reusee avatar Sep 24 '24 05:09 reusee

无进展

reusee avatar Sep 27 '24 14:09 reusee

1.2-dev commit:b0f3a5481复现了,环境已配置相关参数 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/11142247340/job/30999607062 企业微信截图_65967e6e-195a-4ffe-bd53-b4d547c756c8 yaml: https://github.com/matrixorigin/mo-nightly-regression/blob/main/mo-bench-tke.yaml

threads初始时间:09:36:09 报错时间:09:36:39 mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22Wjp%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-nightly-b0f3a5481-20241002%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221727919369000%22,%22to%22:%221727919399000%22%7D%7D%7D&schemaVersion=1&orgId=1 企业微信截图_0ee34a95-f430-4bd3-90db-fea1dddebe1f

heni02 avatar Oct 03 '24 02:10 heni02

main commit:94cbcf83d 复现了该问题 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/11276243006/job/31388222596 企业微信截图_95843a59-6a3b-4689-a815-a5dfdd2c3fdf

日志: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22Q0w%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-main-nightly-94cbcf83d-20241010%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221728623580000%22,%22to%22:%221728623640000%22%7D%7D%7D&schemaVersion=1&orgId=1

heni02 avatar Oct 11 '24 06:10 heni02

https://grafana.ci.matrixorigin.cn/goto/PtIcrekNg?orgId=1

从日志看,dest port 为 443 的连接数量达到两万八。

但 fileservice 内部已经限制了并发的 reader 的数量,暂时不清楚哪里导致了这个问题。

reusee avatar Oct 11 '24 06:10 reusee

working on other issues.

reusee avatar Oct 16 '24 13:10 reusee

最近2两天的REGRESSION都没有在出现过改问题,改为testing状态

aressu1985 avatar Oct 24 '24 13:10 aressu1985