matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: select median(column) from 10y table report 'Lost connection to MySQL server during query'.

Open Ariznawlll opened this issue 1 year ago • 7 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

4170547615910e61bc5a4ae8e950fe4097703256

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job url:(Quries: Aggr 1B test) https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7690791843/job/20968567852

image

oom: image

log:http://175.178.192.213:30088/explore?panes=%7B%22G3A%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22branch-big-data-nightly-4170547%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221706529000000%22,%22to%22:%221706529600000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

table ddl:
create table  if not exists big_data_test.table_basic_for_load_1B(
col1 tinyint,
col2 smallint,
col3 int,
col4 bigint,
col5 tinyint unsigned,
col6 smallint unsigned,
col7 int unsigned,
col8 bigint unsigned,
col9 float,
col10 double,
col11 varchar(255),
col12 Date,
col13 DateTime,
col14 timestamp,
col15 bool,
col16 decimal(16,6),
col17 text,
col18 json,
col19 blob,
col20 binary(255),
col21 varbinary(255),
col22 vecf32(3),
col23 vecf32(3),
col24 vecf64(3),
col25 vecf64(3)
);

select median ddl:
select median(col3) from big_data_test.table_basic_for_load_1B;
select median(col9) from big_data_test.table_basic_for_load_1B;

Additional information

No response

Ariznawlll avatar Jan 30 '24 03:01 Ariznawlll

seems like https://github.com/matrixorigin/matrixone/issues/14441

ouyuanning avatar Jan 30 '24 12:01 ouyuanning

@m-schen 麻烦先看一下目前median的实现

10亿个数求中位数,优先级是否可以降一降?

aunjgr avatar Feb 01 '24 10:02 aunjgr

无进展

m-schen avatar Feb 26 '24 11:02 m-schen

无进展。预计1.2无法解决这个问题,需要有比较完善的spill策略。

m-schen avatar Feb 29 '24 11:02 m-schen

该版本内如果能够将 median 每个group的 private structure变成[]byte由一个bytes vector维护大概可以解决该问题,会变成主动抛出out of memory 或者 memory not enough的错误。

但是不确定能否在该版本完成,目前聚合只能确定完成将result和empty等全由vector维护。

m-schen avatar Mar 05 '24 11:03 m-schen

同上一个评论,目前还没有做这部分。关于agg重构的部分当前只完成了result由内存池维护。

m-schen avatar Mar 08 '24 11:03 m-schen

同上一个评论,仍在处理agg重构中且目前没有做private structure的部分也由mpool维护的工作。

m-schen avatar Mar 13 '24 10:03 m-schen

目前我的分支上median函数的中间内容已由mpool维护,正常设置mpool的情况下将会对这种情况主动抛出内村不足的错误。

相关分支:https://github.com/m-schen/matrixone/tree/main-agg

m-schen avatar Mar 18 '24 11:03 m-schen

同上,该分支还在改bug中。具体进度可以看该分支最近的commit。

m-schen avatar Mar 26 '24 11:03 m-schen

同上,目前分支可以正常跑过bvt,不过还需要修改部分静态代码检查的问题,以及多节点下agg结构序列化的问题。

m-schen avatar Mar 29 '24 11:03 m-schen

目前agg分支测试已经可以通过,不过性能上还有点问题,正在调优。

m-schen avatar Apr 08 '24 12:04 m-schen

该问题在agg分支上已经得到了解决,会变成报错申请过大内存,不会crash. 不过分支还在做性能的优化。

m-schen avatar Apr 11 '24 10:04 m-schen

同上

m-schen avatar Apr 16 '24 12:04 m-schen

相关的pr已经合并到main分支和1.1分支,可以在这两个分支任意一个进行验证。 pr: https://github.com/matrixorigin/matrixone/pull/15227

预期会转换成内存过大的报错,类似Invalid alloc size 1242144768.

可以关掉,后续可以用类似的issue进行追踪,如https://github.com/matrixorigin/matrixone/issues/12021等。

m-schen avatar Apr 19 '24 10:04 m-schen

testing

Ariznawlll avatar Apr 24 '24 10:04 Ariznawlll

testing

Ariznawlll avatar Apr 30 '24 06:04 Ariznawlll

testing

Ariznawlll avatar May 06 '24 04:05 Ariznawlll

如研发上述所说,现在的问题由 'Lost connection to MySQL server during query'变成了''internal error: Invalid alloc size 1147437056')',后续由别的issue跟踪. commit: 6b1a10d62ec53a54394120d8c6327c7886c1ce15 image image

Ariznawlll avatar May 08 '24 10:05 Ariznawlll