matrixone
matrixone copied to clipboard
[Bug]: Occasional Load Timeout in Merge Run TPCH Testing
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Environment
- Version or commit-id (e.g. v0.1.0 or 8b23a93): 85d55d5
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
Occasional load timeouts occur in the TPCH test during a merge run. https://github.com/matrixorigin/matrixone/actions/runs/7015043809/job/19084522575
Abnormal situation
2023-11-28 07:20:13 load data url s3option {'endpoint'='cos.ap-chengdu.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-data-1308875761', 'filepath'='tpch_10/part.tbl'} into table tpch_10g.part fields terminated by '|' lines terminated by '\n' parallel 'true';
2023-11-28 07:21:03 The data for table tpch_10g.part has been loaded successfully, and cost: 49643
2023-11-28 07:21:04 The table tpch_10g.part has been flushed
2023-11-28 07:21:04 This test for [s3_part_tpch_10g] has been executed successfully
2023-11-28 07:21:04 [s3_partsupp_tpch_10g]
2023-11-28 07:21:04 Start to create database[tpch_10g]
2023-11-28 07:21:04 Succeed to create database[tpch_10g]
2023-11-28 07:21:04 Start to drop table [tpch_10g.partsupp].
2023-11-28 07:21:04 Succeed to drop table [tpch_10g.partsupp]
2023-11-28 07:21:04 Start to create table [tpch_10g.partsupp]
2023-11-28 07:21:04 Succeed to create table [tpch_10g.partsupp]
2023-11-28 07:21:10 Start to load data from file tpch_10/partsupp.tbl into table tpch_10g.partsupp,please wait.....
2023-11-28 07:21:10 load data url s3option {'endpoint'='cos.ap-chengdu.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-data-1308875761', 'filepath'='tpch_10/partsupp.tbl'} into table tpch_10g.partsupp fields terminated by '|' lines terminated by '\n' parallel 'true';
2023-11-28 07:21:47 The data for table tpch_10g.partsupp has been loaded successfully, and cost: 37139
2023-11-28 07:21:48 The table tpch_10g.partsupp has been flushed
Expected Behavior
Tue, 28 Nov 2023 08:36:24 GMT
2023-11-28 08:36:24 The data for table tpch_10g.nation has been loaded successfully, and cost: 9686
Tue, 28 Nov 2023 08:36:25 GMT
2023-11-28 08:36:25 The table tpch_10g.nation has been flushed
Tue, 28 Nov 2023 08:36:25 GMT
2023-11-28 08:36:25 This test for [s3_nation_tpch_10g] has been executed successfully
Tue, 28 Nov 2023 08:36:25 GMT
Tue, 28 Nov 2023 08:36:25 GMT
2023-11-28 08:36:25 [s3_orders_tpch_10g]
Tue, 28 Nov 2023 08:36:25 GMT
2023-11-28 08:36:25 Start to create database[tpch_10g]
Tue, 28 Nov 2023 08:36:26 GMT
2023-11-28 08:36:26 Succeed to create database[tpch_10g]
Tue, 28 Nov 2023 08:36:26 GMT
2023-11-28 08:36:26 Start to drop table [tpch_10g.orders].
Tue, 28 Nov 2023 08:36:26 GMT
2023-11-28 08:36:26 Succeed to drop table [tpch_10g.orders]
Tue, 28 Nov 2023 08:36:26 GMT
2023-11-28 08:36:26 Start to create table [tpch_10g.orders]
Tue, 28 Nov 2023 08:36:27 GMT
2023-11-28 08:36:27 Succeed to create table [tpch_10g.orders]
Tue, 28 Nov 2023 08:36:33 GMT
2023-11-28 08:36:33 Start to load data from file tpch_10/orders.tbl into table tpch_10g.orders,please wait.....
Tue, 28 Nov 2023 08:36:33 GMT
2023-11-28 08:36:33 load data url s3option {'endpoint'='cos.ap-chengdu.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-data-13088[75](https://github.com/matrixorigin/matrixone/actions/runs/7015043809/job/19086507767#step:9:76)[76](https://github.com/matrixorigin/matrixone/actions/runs/7015043809/job/19086507767#step:9:77)1', 'filepath'='tpch_10/orders.tbl'} into table tpch_10g.orders fields terminated by '|' lines terminated by '\n' parallel 'true';
Tue, 28 Nov 2023 08:36:55 GMT
2023-11-28 08:36:55 The data for table tpch_10g.orders has been loaded successfully, and cost: 22389
Tue, 28 Nov 2023 08:36:56 GMT
2023-11-28 08:36:56 The table tpch_10g.orders has been flushed
Tue, 28 Nov 2023 08:36:56 GMT
2023-11-28 08:36:56 This test for [s3_orders_tpch_10g] has been executed successfully
Tue, 28 Nov 2023 08:36:56 GMT
Tue, 28 Nov 2023 08:36:56 GMT
2023-11-28 08:36:56 [s3_part_tpch_10g]
Tue, 28 Nov 2023 08:36:56 GMT
2023-11-28 08:36:56 Start to create database[tpch_10g]
Tue, 28 Nov 2023 08:36:57 GMT
2023-11-28 08:36:57 Succeed to create database[tpch_10g]
Steps to Reproduce
No response
Additional information
No response
等待repo中。
目前分析下来的第一个导致变慢的地方是s3 read导致的,s3 read的耗时差距如下:
通过获取explain analyze可以达到读的次数为48次,大致的耗时差距为: 48 * (100-10) = 4.32s。实际的差距为(83951 - 59299) =24.6s,继续分析其他的耗时差距。(此次对比的是lineorder的导入)
通过监控发现,出现问题的时候的两台机器的stw时间特别夸张,具体的监控信息如下:
具体的原因是慢的情况下,对象数目以及携程数目爆增。。
这是慢的case的监控信息:
这是快的时候的监控:
上面一个comment的s3 read变慢的原因可能本质也和这个有关系(可能不是s3抖动导致的)
repro:https://github.com/matrixorigin/matrixone/actions/runs/7055204369/job/19205716138
进一步的分析后,发现load超时并不是某个job特有的现象,就算成功的job,也有个别load时间是不符合预期的。比如https://github.com/matrixorigin/matrixone/actions/runs/7081980921/job/19272638333。失败的job只是抽风的时间比较久,成功的job也有抽风的时间。。。目前的现象是这样,抽风的原因还未找到,不过抽风的时候现象是统一的,现象如下:
- commit请求比较多
- 事务创建时间比较久(s级,甚至可以到5s这个级别)
- 内存分配和stw耗时更为夸张
- dist-read数据暴增(几百激增到几千)
具体导致这个现象的root case还未找到,还在分析,下面是一些监控信息的截图: 超时load监控信息:
正常的load的监控信息:
发现监控的信息存在一些误差,需要修改metrics的一些逻辑来得到真实的信息。。。
修正metric的bug后,现在已经可以看到具体的耗时,主要的原因是因为build plan耗时过久导致的,为什么buid plan耗时这么久。还在分析。
通过metric可以看到耗时都在buildload中,而且这个和stats没关系,stats的耗时是很短的
Based on the COS logs provided by Tencent Cloud, cases of LOAD timeout have a data file object read time that is almost dozens of times longer than normal cases. However, the reasons for such significant fluctuations still need to be further identified.
As load customer
an example:
Failed case:
Normal case:
通过日志已经定位到是compileExternScan的readdir耗时过久导致的:
2023-12-14 07:36:42.696
{"level":"INFO","time":"2023/12/14 07:36:42.696005 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 1.211973043s"}
2023-12-14 07:37:58.008
{"level":"INFO","time":"2023/12/14 07:37:58.008873 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 10.089926363s"}
2023-12-14 07:40:03.791
{"level":"INFO","time":"2023/12/14 07:40:03.791345 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 1.308013093s"}
2023-12-14 07:43:04.237
{"level":"INFO","time":"2023/12/14 07:43:04.237864 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 1m17.615419651s"}
2023-12-14 07:43:55.060
{"level":"INFO","time":"2023/12/14 07:43:55.059980 +0000","caller":"compile/compile.go:1728","msg":"read dir cost 4.198588925s"}
https://github.com/matrixorigin/matrixone/actions/runs/7205520534/job/19629426333
看着像是cos的抖动
继续排查中,在和腾讯云的一起排查
继续排查中,在和腾讯云的一起排查
继续排查中,在和腾讯云的一起排查
继续排查中,在和腾讯云的一起排查
no process
load 的逻辑存在一些问题,后续修改load的逻辑
load小文件s3请求文件数目过多的问题已经修正。
跟莫尘确认了下,small file的问题修复后,应该可以测试验证一下了
问题已经存在
目前load的s3逻辑存在割裂的情况,具体原因是因为每次操作s3的数据都会调用 GetForETL,可是该函数每次都会生成一个新的fs,该fs没有任何cache,因为mo的load存在大量的探测(获取文件大小,探测'\n')等操作,而且读取也是并行读取,会导致的大量冗余的s3操作,而且因为每次都使用的是不同的fs,所以也无法通过cache来减少对s3的操作。甚至因为每次都需要一个新的s3 fs,每次都还需要额外的证书验证等。理想情况可能是一批load的所有语句共享一个s3 fs,这样可能是更加好的。具体的修改需要和岳声讨论后在决定
repro: https://github.com/matrixorigin/matrixone/actions/runs/7686506796/job/20946283428
http://175.178.192.213:30088/explore?panes=%7B%22Fu0%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-checkin-regression-14356%5C%22%7D%20%7C%3D%20%60no%20such%20table%20%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221706487573106%22,%22to%22:%221706488375230%22%7D%7D%7D&schemaVersion=1&orgId=1
通过日志分析,已经得出耗时的函数为ReadFileOffset(2024-01-28 16:25:07.930 {"level":"INFO","time":"2024/01/28 16:25:07.930616 +0000","caller":"compile/compile.go:1901","msg":"read file offset cost 15m33.571241633s"}),这个函数仅仅是获取一个fs,然后读一些数据,通过cos日志以及metric都可以发现s3 read的耗时并不久。 然后s3 connect的metric特别奇怪,metric存在超过最大桶的情况,所以这个耗时应该是s3 connect导致的,具体的原因和解决办法还需要分析
在帮赛赛看fuzzy filter的问题中
no process
no process
处理数据正确性问题中
no process
no process
no process