[Bug]: restore from pitr report 'FATAL' error.
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
main
Commit ID
b6cb831d345a76c55d851f613384b49e9509148b
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
tke测试环境,里面有数据大约2T 集群配置: https://github.com/matrixorigin/mo-nightly-regression/blob/main/.github/workflows/big-data-test.yml
pitr的创建信息(时间是UTC时间,UTC+8时间为2024-10-13 20:23:28):
恢复执行的sql:(执行这条sql的时间大约在UTC+8时间2024-10-15 13:06:07)
restore from pitr p01 "2024-10-14 05:22:29";(UTC时间,对应utc+8时间为"2024-10-14 13:22:29")
大约执行了5h+后报错
Lost connection时间点对应的日志里面有FATAL错误:
log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22iwM%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241013%5C%22%7D%20%7C%3D%20%60FATAL%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221728988680000%22,%22to%22:%221728988800000%22%7D%7D%7D&schemaVersion=1&orgId=1
Expected Behavior
No response
Steps to Reproduce
create pitr p01 range 1 'd';
restore from pitr p01 "2024-10-14 05:22:29";
Additional information
No response
similar to #19288, the result of the query based on the primary key returns 0 rows.
{"level":"FATAL","time":"2024/10/15 10:39:28.348482 +0000","name":"cn-service","caller":"incrservice/store_sql.go:153","msg":"BUG: read incr record invalid","service":"30663632-6334-6633-3961-313239613832","uuid":"30663632-6334-6633-3961-313239613832","fetch-sql":"select offset, step from mo_increment_columns where table_id = 272927 and col_name = '__mo_fake_pk_col' for update","account":0,"table":272927,"col":"__mo_fake_pk_col","rows":0,"cost":"63.832666ms","ctx-done":false,"stacktrace":"github.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate.func2\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:153\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:134\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:129\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:173\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:160\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}
The log indicates:
- The prolonged restore transaction
481c93f16e7a211117fe1ab03e64c04etook over five hours before it failed, involving a large workspace. - Despite the insert SQL statement(
insert into mo_increment_columns(table_id, col_name, col_index, offset, step) values(272927, '__mo_fake_pk_col', 25, 0, 1)) executing successfully less than 200ms earlier, the subsequent select SQL statement could not retrieve that record from the workspace.
This issue has not been reproduced locally. Let's see what the newly added log can tell upon reproduction in the big-data test
After conducting some tests, it seems that the issue is not related to S3 flushing. Instead, the txnOffset seems to be worthy of further investigation
No luck in reproducing the issue.
Same restore operation to repro this issue,but mo oom unfortunately, details in issue #19511
No repro yet, consider downgrading its severity
一致复现不了,先降级处理
job url: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/11575242906/job/32222238080
restore执行的sql:RESTORE ACCOUNT sys FROM SNAPSHOT sys_sp01
数据量:大约1.2T
log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22tFc%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-snapshot-test-20241029%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221730218087000%22,%22to%22:%221730230669000%22%7D%7D%7D&schemaVersion=1&orgId=1
fixed
fixed