matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: restore from pitr report 'FATAL' error.

Open Ariznawlll opened this issue 1 year ago • 6 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

b6cb831d345a76c55d851f613384b49e9509148b

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

tke测试环境,里面有数据大约2T 集群配置: https://github.com/matrixorigin/mo-nightly-regression/blob/main/.github/workflows/big-data-test.yml

pitr的创建信息(时间是UTC时间,UTC+8时间为2024-10-13 20:23:28): image

恢复执行的sql:(执行这条sql的时间大约在UTC+8时间2024-10-15 13:06:07)
restore from pitr p01 "2024-10-14 05:22:29";(UTC时间,对应utc+8时间为"2024-10-14 13:22:29")

大约执行了5h+后报错 image

Lost connection时间点对应的日志里面有FATAL错误: 企业微信截图_e3304c9e-1ad0-4f88-a3d0-2cd91df60aa0

log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22iwM%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241013%5C%22%7D%20%7C%3D%20%60FATAL%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221728988680000%22,%22to%22:%221728988800000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

create pitr p01 range 1 'd';

restore from pitr p01 "2024-10-14 05:22:29";

Additional information

No response

Ariznawlll avatar Oct 15 '24 10:10 Ariznawlll

similar to #19288, the result of the query based on the primary key returns 0 rows.

{"level":"FATAL","time":"2024/10/15 10:39:28.348482 +0000","name":"cn-service","caller":"incrservice/store_sql.go:153","msg":"BUG: read incr record invalid","service":"30663632-6334-6633-3961-313239613832","uuid":"30663632-6334-6633-3961-313239613832","fetch-sql":"select offset, step from mo_increment_columns where table_id = 272927 and col_name = '__mo_fake_pk_col' for update","account":0,"table":272927,"col":"__mo_fake_pk_col","rows":0,"cost":"63.832666ms","ctx-done":false,"stacktrace":"github.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate.func2\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:153\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:134\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:129\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:173\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:160\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}

sukki37 avatar Oct 15 '24 12:10 sukki37

The log indicates:

  • The prolonged restore transaction 481c93f16e7a211117fe1ab03e64c04e took over five hours before it failed, involving a large workspace.
  • Despite the insert SQL statement(insert into mo_increment_columns(table_id, col_name, col_index, offset, step) values(272927, '__mo_fake_pk_col', 25, 0, 1)) executing successfully less than 200ms earlier, the subsequent select SQL statement could not retrieve that record from the workspace.

This issue has not been reproduced locally. Let's see what the newly added log can tell upon reproduction in the big-data test

aptend avatar Oct 17 '24 10:10 aptend

After conducting some tests, it seems that the issue is not related to S3 flushing. Instead, the txnOffset seems to be worthy of further investigation

aptend avatar Oct 18 '24 10:10 aptend

No luck in reproducing the issue.

aptend avatar Oct 21 '24 10:10 aptend

Same restore operation to repro this issue,but mo oom unfortunately, details in issue #19511

Ariznawlll avatar Oct 22 '24 03:10 Ariznawlll

No repro yet, consider downgrading its severity

aptend avatar Oct 23 '24 10:10 aptend

一致复现不了,先降级处理

aressu1985 avatar Oct 24 '24 02:10 aressu1985

job url: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/11575242906/job/32222238080

restore执行的sql:RESTORE ACCOUNT sys FROM SNAPSHOT sys_sp01

数据量:大约1.2T

image image image

log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22tFc%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-snapshot-test-20241029%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221730218087000%22,%22to%22:%221730230669000%22%7D%7D%7D&schemaVersion=1&orgId=1

Ariznawlll avatar Oct 30 '24 02:10 Ariznawlll

fixed

XuPeng-SH avatar Nov 01 '24 12:11 XuPeng-SH

fixed

Ariznawlll avatar Nov 02 '24 15:11 Ariznawlll