[Bug]: use load_file() function to load stage data, mo report 'ERROR 20101 (HY000): internal error: Invalid alloc size 7775731712'.
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
main
Commit ID
c678c5a6908fd9132443478123acc74ddc1653cb
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
数据量:7.2G
log:https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22GYP%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-717cb94-202409191131%5C%22%7D%20%7C%3D%20%60Invalid%20alloc%20size%207775731712%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221726738859000%22,%22to%22:%221726739579000%22%7D%7D%7D&schemaVersion=1&orgId=1
Expected Behavior
load_file success
Steps to Reproduce
1. 创建基于minio的stage:
create stage stage01 url = 's3://stagetest/tpch_10' credentials = {"aws_key_id"='***',"aws_secret_key"='***',"AWS_REGION"='us-east-2','PROVIDER'='minio', 'ENDPOINT'='http://10.222.6.2'};
2. 启动mo分布式环境(128服务器上)
3. 通过load_file()函数加载数据
select load_file(cast('stage://stage01/lineitem.tbl' as datalink)); -->出错的sql
Additional information
No response
创建基于file system的stage也会出现这个问题:
在128服务器上: create stage stage01 url='file:///data/mo-load-data/data/tpch_10'; create database tpch_10g; CREATE TABLE LINEITEM( L_ORDERKEY BIGINT NOT NULL, L_PARTKEY INTEGER NOT NULL, L_SUPPKEY INTEGER NOT NULL, L_LINENUMBER INTEGER NOT NULL, L_QUANTITY DECIMAL(15,2) NOT NULL, L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL, L_DISCOUNT DECIMAL(15,2) NOT NULL, L_TAX DECIMAL(15,2) NOT NULL, L_RETURNFLAG VARCHAR(1) NOT NULL, L_LINESTATUS VARCHAR(1) NOT NULL, L_SHIPDATE DATE NOT NULL, L_COMMITDATE DATE NOT NULL, L_RECEIPTDATE DATE NOT NULL, L_SHIPINSTRUCT CHAR(25) NOT NULL, L_SHIPMODE CHAR(10) NOT NULL, L_COMMENT VARCHAR(44) NOT NULL, PRIMARY KEY (L_ORDERKEY, L_LINENUMBER) );
select load_file(cast('stage://stage01/lineitem.tbl' as datalink)); -->报错ERROR 20101 (HY000): internal error: Invalid alloc size 7775731712
main commit: 2f86fead2
第一次执行报错OOM
第二次执行报错'internal error: Invalid alloc size 7775731712'
还没合入main,main需要等合进去测试
对于这个问题,建议首先在文档中建议客户不要使用load_file 一次性的导入过大的文件,例如:5G。
另外,对于这个issue 建议在2.1.0 版本修复。
波洋看是否可以分批导入
file data is actually loaded into the memory but memory pool inside the vector returns error.
The highlighted code raised error. Seems nothing we can do if limit of memory pool is 2G.
I think the right thing to read whole file content is to read by multiple chunks via datalink(offset, size) just like HNSW did.
select 0, cast('stage_url?offset=0&size=64000' as datalink); select 1, cast('stage_url?offset=64000&size=64000' as datalink);
just like multi-part in S3.