matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: [date 3.10]tke regression: sysbench 1000w delete/update auto_increment index reported stream closed

Open heni02 opened this issue 11 months ago • 3 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

15af2cf1a

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8222697705/job/22484639356 企业微信截图_36444307-0e9e-4b2e-b24b-0db45008d977

sysbench1000w delete/update测试schema 包含自增列和index(date3.10第一次测试),之前的流程是不包含自增列和索引 企业微信截图_3d466732-f61f-45b5-bfec-9c836e91b760

log: http://175.178.192.213:30088/explore?panes=%7B%22IeX%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240310%5C%22%7D%20%7C%3D%20%60stream%20closed%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221710118800000%22,%22to%22:%221710125999000%22%7D%7D%7D&schemaVersion=1&orgId=1 有大量的use of closed network connection,定位下是否是这个原因导致stream closed,是否符合预期 企业微信截图_7f7bbb22-a6c9-46d6-a17f-4892cb0f1e8b

Expected Behavior

No response

Steps to Reproduce

tke regression sysbench1000w delete/update测试

Additional information

No response

heni02 avatar Mar 11 '24 03:03 heni02

date 3.11 regression delete/update 也出现该问题 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8234567942/job/22539827858 企业微信截图_5576761d-490e-4841-96b8-11a57336058e 企业微信截图_4813e78d-2561-4184-8dae-56da0368805d 其他sysbench场景也出现了 企业微信截图_f75f475d-6b79-4338-af87-265a28c388db

log: 也是大量报use of closed network connection 企业微信截图_78394457-416a-4429-9306-e5afdc13f632 http://175.178.192.213:30088/explore?panes=%7B%227iC%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240311%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221710209640000%22,%22to%22:%221710209700000%22%7D%7D%7D&schemaVersion=1&orgId=1

heni02 avatar Mar 12 '24 03:03 heni02

1、发生的直接原因是:心跳检测超时了。 2、心跳检测超时的原因是rpc消费队列插不进去,等待时间太长了。(超过40秒,这里设置的是10秒超时)

跟莫尘、张旭讨论先临时解决: 把rpc等待超时的时间延长到120.

未来最终的方案可能是: 1、找到为什么阻塞(是执行慢还是IO打满了...,找到对应的处理方案) 2、心跳检测分离开来

ouyuanning avatar Mar 13 '24 06:03 ouyuanning

date 3.13,#14947pr合进去后还有该问题 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8266410811/job/22634292651 企业微信截图_b3146675-05ee-4174-97a2-e320948b1e3c

heni02 avatar Mar 14 '24 02:03 heni02

有降低了概率,但是还有问题。还没时间继续挖

ouyuanning avatar Mar 20 '24 01:03 ouyuanning

tke sysbench 1000w delete tps only 5 企业微信截图_4feef7b7-d796-4dbd-9cdc-25895d3186ae

heni02 avatar Mar 25 '24 08:03 heni02

在处理prepare重构

ouyuanning avatar Apr 24 '24 12:04 ouyuanning

可能张旭的 https://github.com/matrixorigin/matrixone/pull/15181 这个PR解决了这个问题

ouyuanning avatar Apr 28 '24 07:04 ouyuanning

最近回归没有出现stream closed,但rpc阻塞问题还是有,先降级为s1

heni02 avatar Apr 28 '24 07:04 heni02

辛苦张旭帮忙处理一下

ouyuanning avatar Apr 28 '24 07:04 ouyuanning

明松和nitao的pr已经解决了提早结束的问题。看看还有没有。目前还没有投入去看

zhangxu19830126 avatar Jul 03 '24 06:07 zhangxu19830126

Closing this issue due to inactivity. Feel free to reopen or create a new issue if needed. Thanks!

sukki37 avatar Jul 03 '24 08:07 sukki37