matrixone [Bug]: Occasional Load Timeout in Merge Run TPCH Testing

Is there an existing issue for the same bug?

[X] I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93): 85d55d5
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

Occasional load timeouts occur in the TPCH test during a merge run. https://github.com/matrixorigin/matrixone/actions/runs/7015043809/job/19084522575

Abnormal situation

2023-11-28 07:20:13 load data url s3option {'endpoint'='cos.ap-chengdu.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-data-1308875761', 'filepath'='tpch_10/part.tbl'} into table tpch_10g.part fields terminated by '|' lines terminated by '\n' parallel 'true';
2023-11-28 07:21:03 The data for table tpch_10g.part has been loaded successfully, and cost: 49643
2023-11-28 07:21:04 The table tpch_10g.part has been flushed
2023-11-28 07:21:04 This test for [s3_part_tpch_10g] has been executed successfully

2023-11-28 07:21:04 [s3_partsupp_tpch_10g]
2023-11-28 07:21:04 Start to create database[tpch_10g]
2023-11-28 07:21:04 Succeed to create database[tpch_10g]
2023-11-28 07:21:04 Start to drop table [tpch_10g.partsupp].
2023-11-28 07:21:04 Succeed to drop table [tpch_10g.partsupp]
2023-11-28 07:21:04 Start to create table [tpch_10g.partsupp]
2023-11-28 07:21:04 Succeed to create table [tpch_10g.partsupp]
2023-11-28 07:21:10 Start to load data from file tpch_10/partsupp.tbl into table tpch_10g.partsupp,please wait.....
2023-11-28 07:21:10 load data url s3option {'endpoint'='cos.ap-chengdu.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-data-1308875761', 'filepath'='tpch_10/partsupp.tbl'} into table tpch_10g.partsupp fields terminated by '|' lines terminated by '\n' parallel 'true';
2023-11-28 07:21:47 The data for table tpch_10g.partsupp has been loaded successfully, and cost: 37139
2023-11-28 07:21:48 The table tpch_10g.partsupp has been flushed

Expected Behavior

Tue, 28 Nov 2023 08:36:24 GMT
2023-11-28 08:36:24 The data for table tpch_10g.nation has been loaded successfully, and cost: 9686
Tue, 28 Nov 2023 08:36:25 GMT
2023-11-28 08:36:25 The table tpch_10g.nation has been flushed
Tue, 28 Nov 2023 08:36:25 GMT
2023-11-28 08:36:25 This test for [s3_nation_tpch_10g] has been executed successfully
Tue, 28 Nov 2023 08:36:25 GMT

Tue, 28 Nov 2023 08:36:25 GMT
2023-11-28 08:36:25 [s3_orders_tpch_10g]
Tue, 28 Nov 2023 08:36:25 GMT
2023-11-28 08:36:25 Start to create database[tpch_10g]
Tue, 28 Nov 2023 08:36:26 GMT
2023-11-28 08:36:26 Succeed to create database[tpch_10g]
Tue, 28 Nov 2023 08:36:26 GMT
2023-11-28 08:36:26 Start to drop table [tpch_10g.orders].
Tue, 28 Nov 2023 08:36:26 GMT
2023-11-28 08:36:26 Succeed to drop table [tpch_10g.orders]
Tue, 28 Nov 2023 08:36:26 GMT
2023-11-28 08:36:26 Start to create table [tpch_10g.orders]
Tue, 28 Nov 2023 08:36:27 GMT
2023-11-28 08:36:27 Succeed to create table [tpch_10g.orders]
Tue, 28 Nov 2023 08:36:33 GMT
2023-11-28 08:36:33 Start to load data from file tpch_10/orders.tbl into table tpch_10g.orders,please wait.....
Tue, 28 Nov 2023 08:36:33 GMT
2023-11-28 08:36:33 load data url s3option {'endpoint'='cos.ap-chengdu.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-data-13088[75](https://github.com/matrixorigin/matrixone/actions/runs/7015043809/job/19086507767#step:9:76)[76](https://github.com/matrixorigin/matrixone/actions/runs/7015043809/job/19086507767#step:9:77)1', 'filepath'='tpch_10/orders.tbl'} into table tpch_10g.orders fields terminated by '|' lines terminated by '\n' parallel 'true';
Tue, 28 Nov 2023 08:36:55 GMT
2023-11-28 08:36:55 The data for table tpch_10g.orders has been loaded successfully, and cost: 22389
Tue, 28 Nov 2023 08:36:56 GMT
2023-11-28 08:36:56 The table tpch_10g.orders has been flushed
Tue, 28 Nov 2023 08:36:56 GMT
2023-11-28 08:36:56 This test for [s3_orders_tpch_10g] has been executed successfully
Tue, 28 Nov 2023 08:36:56 GMT

Tue, 28 Nov 2023 08:36:56 GMT
2023-11-28 08:36:56 [s3_part_tpch_10g]
Tue, 28 Nov 2023 08:36:56 GMT
2023-11-28 08:36:56 Start to create database[tpch_10g]
Tue, 28 Nov 2023 08:36:57 GMT
2023-11-28 08:36:57 Succeed to create database[tpch_10g]

Steps to Reproduce

No response

Additional information

No response

Nov 28 '23 09:11 sukki37

等待repo中。

Nov 29 '23 03:11 nnsgmsone

目前分析下来的第一个导致变慢的地方是s3 read导致的，s3 read的耗时差距如下: 2023-11-30 14-50-57 的屏幕截图 2023-11-30 14-51-02 的屏幕截图通过获取explain analyze可以达到读的次数为48次，大致的耗时差距为: 48 * (100-10) = 4.32s。实际的差距为(83951 - 59299) =24.6s，继续分析其他的耗时差距。(此次对比的是lineorder的导入)

Nov 30 '23 06:11 nnsgmsone

通过监控发现，出现问题的时候的两台机器的stw时间特别夸张，具体的监控信息如下:

2023-11-30 15-45-07 的屏幕截图 2023-11-30 15-47-05 的屏幕截图具体的原因是慢的情况下，对象数目以及携程数目爆增。。这是慢的case的监控信息: 2023-11-30 15-59-18 的屏幕截图

2023-11-30 16-01-08 的屏幕截图这是快的时候的监控:

2023-11-30 16-01-34 的屏幕截图 2023-11-30 16-01-48 的屏幕截图

上面一个comment的s3 read变慢的原因可能本质也和这个有关系(可能不是s3抖动导致的)

Nov 30 '23 07:11 nnsgmsone

repro：https://github.com/matrixorigin/matrixone/actions/runs/7055204369/job/19205716138

Dec 01 '23 04:12 sukki37

进一步的分析后，发现load超时并不是某个job特有的现象，就算成功的job，也有个别load时间是不符合预期的。比如https://github.com/matrixorigin/matrixone/actions/runs/7081980921/job/19272638333。失败的job只是抽风的时间比较久，成功的job也有抽风的时间。。。目前的现象是这样，抽风的原因还未找到，不过抽风的时候现象是统一的，现象如下:

commit请求比较多
事务创建时间比较久(s级，甚至可以到5s这个级别)
内存分配和stw耗时更为夸张
dist-read数据暴增(几百激增到几千)

具体导致这个现象的root case还未找到，还在分析，下面是一些监控信息的截图: 超时load监控信息:

截图 2023-12-04 17-21-36

截图 2023-12-04 17-22-27

正常的load的监控信息:

截图 2023-12-04 17-28-51 截图 2023-12-04 17-29-05

Dec 04 '23 09:12 nnsgmsone

发现监控的信息存在一些误差，需要修改metrics的一些逻辑来得到真实的信息。。。

Dec 04 '23 10:12 nnsgmsone

截图 2023-12-05 14-14-16

截图 2023-12-05 14-14-38 截图 2023-12-05 14-07-42 修正metric的bug后，现在已经可以看到具体的耗时，主要的原因是因为build plan耗时过久导致的，为什么buid plan耗时这么久。还在分析。

Dec 05 '23 06:12 nnsgmsone

截图 2023-12-08 17-45-49 截图 2023-12-08 17-46-02 通过metric可以看到耗时都在buildload中，而且这个和stats没关系，stats的耗时是很短的

Dec 08 '23 09:12 nnsgmsone

Based on the COS logs provided by Tencent Cloud, cases of LOAD timeout have a data file object read time that is almost dozens of times longer than normal cases. However, the reasons for such significant fluctuations still need to be further identified.

As load customer an example:

Failed case: 企业微信截图_7cbd0325-2cd7-44d2-a990-54fdd29e35e7

Normal case: 企业微信截图_dde21237-763c-479d-aa7d-8b38a2ae741a

Dec 12 '23 01:12 sukki37

通过日志已经定位到是compileExternScan的readdir耗时过久导致的:

2023-12-14 07:36:42.696	
{"level":"INFO","time":"2023/12/14 07:36:42.696005 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 1.211973043s"}



2023-12-14 07:37:58.008	
{"level":"INFO","time":"2023/12/14 07:37:58.008873 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 10.089926363s"}
2023-12-14 07:40:03.791	
{"level":"INFO","time":"2023/12/14 07:40:03.791345 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 1.308013093s"}
2023-12-14 07:43:04.237	
{"level":"INFO","time":"2023/12/14 07:43:04.237864 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 1m17.615419651s"}
2023-12-14 07:43:55.060	
{"level":"INFO","time":"2023/12/14 07:43:55.059980 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 4.198588925s"}

https://github.com/matrixorigin/matrixone/actions/runs/7205520534/job/19629426333

Dec 14 '23 09:12 nnsgmsone

看着像是cos的抖动

Dec 14 '23 09:12 nnsgmsone

继续排查中，在和腾讯云的一起排查

Dec 20 '23 10:12 nnsgmsone

继续排查中，在和腾讯云的一起排查

Dec 25 '23 10:12 nnsgmsone

继续排查中，在和腾讯云的一起排查

Dec 28 '23 10:12 nnsgmsone

继续排查中，在和腾讯云的一起排查

Jan 03 '24 10:01 nnsgmsone

no process

Jan 12 '24 10:01 nnsgmsone

load 的逻辑存在一些问题，后续修改load的逻辑

Jan 17 '24 10:01 nnsgmsone

load小文件s3请求文件数目过多的问题已经修正。

Jan 19 '24 06:01 nnsgmsone

跟莫尘确认了下，small file的问题修复后，应该可以测试验证一下了

Jan 23 '24 09:01 ouyuanning

问题已经存在

Jan 25 '24 15:01 aressu1985

目前load的s3逻辑存在割裂的情况，具体原因是因为每次操作s3的数据都会调用 GetForETL，可是该函数每次都会生成一个新的fs，该fs没有任何cache，因为mo的load存在大量的探测(获取文件大小，探测'\n')等操作，而且读取也是并行读取，会导致的大量冗余的s3操作，而且因为每次都使用的是不同的fs，所以也无法通过cache来减少对s3的操作。甚至因为每次都需要一个新的s3 fs，每次都还需要额外的证书验证等。理想情况可能是一批load的所有语句共享一个s3 fs，这样可能是更加好的。具体的修改需要和岳声讨论后在决定

Jan 26 '24 09:01 nnsgmsone

repro: https://github.com/matrixorigin/matrixone/actions/runs/7686506796/job/20946283428

http://175.178.192.213:30088/explore?panes=%7B%22Fu0%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-checkin-regression-14356%5C%22%7D%20%7C%3D%20%60no%20such%20table%20%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221706487573106%22,%22to%22:%221706488375230%22%7D%7D%7D&schemaVersion=1&orgId=1

Jan 29 '24 02:01 sukki37

通过日志分析，已经得出耗时的函数为ReadFileOffset(2024-01-28 16:25:07.930 {"level":"INFO","time":"2024/01/28 16:25:07.930616 +0000","caller":"compile/compile.go:1901","msg":"read file offset cost 15m33.571241633s"})，这个函数仅仅是获取一个fs，然后读一些数据，通过cos日志以及metric都可以发现s3 read的耗时并不久。然后s3 connect的metric特别奇怪，metric存在超过最大桶的情况，所以这个耗时应该是s3 connect导致的，具体的原因和解决办法还需要分析

915754480

Jan 29 '24 08:01 nnsgmsone

在帮赛赛看fuzzy filter的问题中

Feb 01 '24 10:02 nnsgmsone

no process

Feb 21 '24 13:02 nnsgmsone

no process

Feb 26 '24 10:02 nnsgmsone

处理数据正确性问题中

Feb 29 '24 10:02 nnsgmsone

no process

Mar 05 '24 10:03 nnsgmsone

no process

Mar 08 '24 10:03 nnsgmsone

no process

Mar 13 '24 10:03 nnsgmsone

matrixone matrixone copied to clipboard

[Bug]: Occasional Load Timeout in Merge Run TPCH Testing

Is there an existing issue for the same bug?

Environment

Actual Behavior

Expected Behavior

Steps to Reproduce

Additional information

matrixone
matrixone copied to clipboard