matrixone
matrixone copied to clipboard
[Bug]: Insert in transaction reports error `20101 (HY000): internal error: Invalid alloc size 1147486208`
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Environment
- Version or commit-id (e.g. v0.1.0 or 8b23a93): nightly-898ac286
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
- Open a transaction
- In the transaction, insert 1000 records in one batch, and when it comes to the second batch, the error
20101 (HY000): internal error: Invalid alloc size 1147486208
is fired. The schema is as below:
MySQL [mo_c]> desc mo_catalog.statement_cx;
+--------------+------------------+------+------+---------+-------+--------------------------------+
| Field | Type | Null | Key | Default | Extra | Comment |
+--------------+------------------+------+------+---------+-------+--------------------------------+
| statement_id | VARCHAR(36) | NO | PRI | NULL | | |
| account | VARCHAR(300) | NO | | NULL | | |
| response_at | DATETIME(3) | YES | | NULL | | |
| cx | DECIMAL128(23) | NO | | NULL | | |
| account_id | INT UNSIGNED(32) | NO | PRI | NULL | | the account_id added by the mo |
+--------------+------------------+------+------+---------+-------+--------------------------------+
5 rows in set (0.30 sec)
Here is an example record:
('37cb68b3-61f4-11ee-ac82-eeab67a4d8c2','b6011b4d_75b2_4458_875c_42f6dd16d927','2023-10-03 13:53:23.105','0',9000001)
Expected Behavior
Insert records successfully
Steps to Reproduce
NA
Additional information
Log: http://47.97.80.230/explore?orgId=1&left=%7B%22datasource%22:%22P3C839627617F91E0%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22freetier-02%5C%22,%20app%3D%5C%22default-cn%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22P3C839627617F91E0%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D
待明天复现
使用以下脚本,未能复现: 12021.sql.zip
Just check the existing log here: http://47.97.80.230/explore?orgId=1&left=%7B%22datasource%22:%22P3C839627617F91E0%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22P3C839627617F91E0%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22%7Bapp%3D%5C%22default-cn%5C%22%7D%20%7C%3D%20%60Invalid%20alloc%20size%60%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221696644000000%22,%22to%22:%221696672799000%22%7D%7D
也未复现,而且DAILY REGRESSION每天都有很多次Sysbench load data的操作,行为也是batch insert,每个batch的数据量是10000,每个record比这个数据量还大,也都未出现过这个问题
从日志看不出是什么样的statement引起的。但是日志里报错都是 Invalid alloc size 1147486208
应该是相同的sql,相同的batch大小引起的。
今天请假。明天加些log再看看吧
升级到 nightly-f5405008
,尝试 insert 几十万条,没有出现之前的在 2000 条就报 invalid alloc size 的问题,但是最终 insert 会失败,CN 会挂掉:
lastState:
terminated:
containerID: containerd://a516cee87e3d2c58e0ddb03d63db09f56fb5950659c4f208e715c750abe89713
exitCode: 2
finishedAt: "2023-10-13T04:01:02Z"
reason: Error
startedAt: "2023-10-13T04:00:31Z"
原因应该是:
- Transaction 内数据量太大
- update on duplicate 占用内存太多
所以这个 bug,
- MO 侧,暂时降低优先级,和 https://github.com/matrixorigin/matrixone/issues/11793 一起改
- 应用侧,先改成不在一个 Transaction 内 insert 这么多数据,另外使用 on duplicate ignore
目前这个报错一般由2个原因引起 1、数据量大的时候。build hash时暴了。类似 #https://github.com/matrixorigin/MO-Cloud/issues/1414 2、算子对batch没有做切分,类似 https://github.com/matrixorigin/matrixone/issues/11793
1的部分。下周要跟龙冉确认下,是否转到spill的issue上
应用改成没有使用 Transaction,每次插入 1000 条后,跑了一段时间,开始没有报错,后来又报这个错误了
修改已经合入。。。
应用暂时修改不使用 on duplicate key update
语句规避了这个问题
on duplicate key 前的left join引起。 但是修改为right join的话会有正确性问题
如果 这个issue https://github.com/matrixorigin/matrixone/issues/12176 好了的话。 也可以使用这个方案先做替代:
drop database if exists db1;
create database db1;
use db1;
create table t1(a int primary key, b int);
insert into t1 values (1,1),(3,3);
insert into t1 select b.* from t1 right join (values row(1,1),row(2,2)) b(a,b) on t1.a=b.a where t1.a is null; //这个语句。
//其实只插入pk检查不存在的数据。语义上跟 insert ignore是一样的
#12176 是误报 可以用那个方案试一试
not working on it
not working on it
not working on it
not working on it
not working on it
not working on it
on duplicate key是阻塞类算子。得等spill好了之后,持久化才行。
not working on it
not working on it
等spills好了再处理
等spills好了才能处理
not working on it
not working on it
not working on it
not working on it
not working on it