matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: lost connect when run SSB 100G Q4.3 no filter query

Open aressu1985 opened this issue 2 years ago • 51 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):b8d83a2d1b04d2533242cf80a0a0fe671ed23fbb
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

The SSB 100G Q1.4 no filter query was failed by connection lost by mo, the doubtful error is :

023/01/07 23:09:04.589954 +0800 INFO frontend/util.go:499 query trace status {"connection_id": 1097, "statement": "select taskk _id, task_metadata_id, task_metadata_executor, task_metadata_context, task_metadata_option, task_parent_id, task_status, task_rr unner, task_epoch, last_heartbeat, result_code, error_msg, create_at, end_at from mo_task.sys_async_task where task_status = 0
order by task_id", "status": "success", "span": {"trace_id": "399be8dd-8e9d-11ed-902e-b07b25f8b524", "kind": "statement"}, "sess sion_info": "connectionId 1097"} 2023/01/07 23:09:05.131382 +0800 ERROR fileservice/local_etl_fs.go:73 error: file sys/logs/2023/01/07/rawlog/1673104144_7c4dccbb 4-4d3c-41f8-b482-5251dc7a41bf_ALL.csv already exists github.com/matrixorigin/matrixone/pkg/fileservice.(*LocalETLFS).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/fileservice/local_etl_fs.go:73 github.com/matrixorigin/matrixone/pkg/fileservice.(*FileServices).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/fileservice/file_services.go:116 github.com/matrixorigin/matrixone/pkg/util/export.(*FSWriter).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/fs_writer.go:134 github.com/matrixorigin/matrixone/pkg/util/export.(*FSWriter).WriteString /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/fs_writer.go:158 github.com/matrixorigin/matrixone/pkg/util/trace.batchCSVHandler.NewItemBatchHandler.func1 /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/trace/buffer_pipe.go:111 github.com/matrixorigin/matrixone/pkg/util/trace.batchCSVHandler.NewItemBatchHandler.func2 /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/trace/buffer_pipe.go:124 github.com/matrixorigin/matrixone/pkg/util/export.(*bufferExportReq).handle /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/batch_processor.go:124 github.com/matrixorigin/matrixone/pkg/util/export.(*MOCollector).doExport /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/batch_processor.go:359 2023/01/07 23:09:05.131477 +0800 ERROR fileservice/local_etl_fs.go:73 error: file sys/logs/2023/01/07/rawlog/1673104144_7c4dccbb 4-4d3c-41f8-b482-5251dc7a41bf_ALL.csv already exists github.com/matrixorigin/matrixone/pkg/fileservice.(*LocalETLFS).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/fileservice/local_etl_fs.go:73 github.com/matrixorigin/matrixone/pkg/fileservice.(*FileServices).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/fileservice/file_services.go:116 github.com/matrixorigin/matrixone/pkg/util/export.(*FSWriter).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/fs_writer.go:134 github.com/matrixorigin/matrixone/pkg/util/export.(*FSWriter).WriteString /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/fs_writer.go:158 github.com/matrixorigin/matrixone/pkg/util/trace.batchCSVHandler.NewItemBatchHandler.func1

2023/01/07 23:10:08.331652 +0800 ERROR hakeeper-client-backend morpc/backend.go:472 read from backend failed {"remote": "127.0.. 0.1:32001", "backend-id": "529afce7-b795-435c-ba76-8268aa8756a3", "error": "read tcp4 127.0.0.1:38894->127.0.0.1:32001: use of
closed network connection"} 2023/01/07 23:10:08.331855 +0800 ERROR hakeeper-client-backend morpc/backend.go:477 read loop stopped {"remote": "127.0.0.1:3200 01", "backend-id": "529afce7-b795-435c-ba76-8268aa8756a3"} 2023/01/07 23:10:08.331914 +0800 ERROR hakeeper-client-backend [email protected]/session.go:496 close conn neciton failed {"remote": "127.0.0.1:32001", "backend-id": "529afce7-b795-435c-ba76-8268aa8756a3", "session-id": 0, "error": "cc lose tcp4 127.0.0.1:38894->127.0.0.1:32001: use of closed network connection"} 2023/01/07 23:10:08.331945 +0800 INFO hakeeper-client-backend morpc/backend.go:364 write loop stopped {"remote": "127.0.0.1:3200 01", "backend-id": "529afce7-b795-435c-ba76-8268aa8756a3"} 2023/01/07 23:10:08.396192 +0800 ERROR rpc-client[hakeeper-client([connectToHAKeeper])] morpc/client.go:334 gc inactive backendd s task stopped 2023/01/07 23:10:08.396249 +0800 ERROR morpc/message.go:41 error: invalid input: timeout has invalid deadline github.com/matrixorigin/matrixone/pkg/common/morpc.RPCMessage.GetTimeoutFromContext /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/common/morpc/message.go:41 github.com/matrixorigin/matrixone/pkg/common/morpc.(*server).startWriteLoop.func1 /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/common/morpc/server.go:286 github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1 /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/common/stopper/stopper.go:259 2023/01/07 23:10:08.437851 +0800 ERROR hakeeper-client-backend morpc/backend.go:472 read from backend failed {"remote": "127.0.. 0.1:32001", "backend-id": "7d1ad83c-48e7-4675-92e5-9d3a37fc8838", "error": "read tcp4 127.0.0.1:51014->127.0.0.1:32001: use of
closed network connection"} 2023/01/07 23:10:08.437999 +0800 ERROR hakeeper-client-backend morpc/backend.go:477 read loop stopped {"remote": "127.0.0.1:3200 01", "backend-id": "7d1ad83c-48e7-4675-92e5-9d3a37fc8838"}

mo-ssb-lostcon.tar.gz

Expected Behavior

No response

Steps to Reproduce

run ssb 100G Q4.3 no filter test:
select year(d_datekey) as year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit, c_region, s_nation, p_category
from lineorder
join date on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
group by year(d_datekey), s_city, p_brand, c_region, s_nation, p_category;

Additional information

No response

aressu1985 avatar Jan 08 '23 09:01 aressu1985

tried this query again and it runs successfully. the memory cost is too high. probably same issue as https://github.com/matrixorigin/matrixone/issues/7292

badboynt1 avatar Jan 10 '23 11:01 badboynt1

can not reproduce now. keep working

badboynt1 avatar Jan 13 '23 07:01 badboynt1

can not reproduce now. keep working

badboynt1 avatar Jan 17 '23 09:01 badboynt1

can not reproduce now. keep working

badboynt1 avatar Jan 20 '23 07:01 badboynt1

can not reproduce now. keep working

badboynt1 avatar Jan 28 '23 09:01 badboynt1

select year(d_datekey) as year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit, 
c_region, s_nation, p_category 
from lineorder join dates on lo_orderdate = d_datekey join customer on lo_custkey = c_custkey join supplier on lo_suppkey = s_suppkey join part on lo_partkey = p_partkey 
group by year(d_datekey), s_city, p_brand, c_region, s_nation, p_category order by year, s_city, p_brand;

plan

This Q4.3, and the plan is

fengttt avatar Jan 30 '23 04:01 fengttt

According to @badboynt1 for 1G data, the query will produce 5 million rows. The main memory consumption is the hash agg. which should have used roughly 500MB memory. For 100G, ok, we will use 50G. The following discussion all based on this 50G assumption.

There are many problems with current execution engine.

  1. for this query, the execution should be pipelined. The agg should be executed in parallel. So suppose we have 2 pipelines running the hash agg, each should consume 25GB, so total still 50G.
  2. If we are doing 2 stage agg, first stage should not use more that 1G (I would do 64MB or something) by streaming bottom.
  3. We should not numerous small allocations.
  4. We do not have memory accountting for hash agg.
  5. If we out of memory budget, we should spill.

I am pretty sure all these apply to hash join as well. So we need to systematically take out all these problems.

fengttt avatar Jan 30 '23 04:01 fengttt

[cn.frontend] GuestMmuLimitation = 1000000000

this means mpool can allocate 1g memory at most in one session, or the sql will report an error.

badboynt1 avatar Feb 01 '23 06:02 badboynt1

testing GuestMmuLimitation config

badboynt1 avatar Feb 07 '23 02:02 badboynt1

This cannot be fixed/tested by config tweaks. Must fix the planner and executor.

fengttt avatar Feb 09 '23 20:02 fengttt

I'll be at 0.8 fix.

nnsgmsone avatar Feb 20 '23 08:02 nnsgmsone

转成feature #388把 @aressu1985

nnsgmsone avatar Mar 08 '23 09:03 nnsgmsone

I will deal with the memory issue first

nnsgmsone avatar Mar 23 '23 09:03 nnsgmsone

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone avatar Mar 24 '23 10:03 nnsgmsone

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone avatar Mar 27 '23 10:03 nnsgmsone

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone avatar Mar 29 '23 10:03 nnsgmsone

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone avatar Apr 03 '23 10:04 nnsgmsone

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone avatar Apr 06 '23 09:04 nnsgmsone

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone avatar Apr 12 '23 10:04 nnsgmsone

等plan重构后处理这个问题

nnsgmsone avatar May 19 '23 13:05 nnsgmsone

#7891 is done. So this one should be done as well?

fengttt avatar Jun 06 '23 19:06 fengttt

需要spill才可以很好的运行,在64GB的机器上,内存的占用还是太大。。。根本原因是对于所有的数据而言,我们需要存一份hashtable+key vector这样的内存数据。资源占用比较冗余。

nnsgmsone avatar Aug 30 '23 02:08 nnsgmsone

@aunjgr 等spill好了应该就可以了。如果到时候还需要算子其他优化再转回来吧

ouyuanning avatar Jan 02 '24 06:01 ouyuanning

not working on it today

aunjgr avatar Jan 25 '24 11:01 aunjgr

not working on it today

aunjgr avatar Jan 31 '24 13:01 aunjgr

not working on it today

aunjgr avatar Feb 05 '24 12:02 aunjgr

在解决中移物联的issue

aunjgr avatar Feb 22 '24 12:02 aunjgr

not working on it today

aunjgr avatar Feb 28 '24 10:02 aunjgr

not working on it today

aunjgr avatar Mar 04 '24 12:03 aunjgr

not working on it today

aunjgr avatar Mar 08 '24 11:03 aunjgr