[Bug]: restore 2T data from snapshot report 'table does not exist'.
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
main
Commit ID
cf5296b
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
恢复了大约5h后报错 table does not exist
mysql> restore account sys from snapshot sp01;
ERROR 1064 (HY000): SQL parser error: table "table_with_pk_index_for_write_1b" does not exist
日志:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%223wf%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241016%5C%22%7D%20%7C%3D%20%60sp01%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221729137600000%22,%22to%22:%221729155600000%22%7D%7D%7D&schemaVersion=1&orgId=1
快照读能读到数据:
Expected Behavior
No response
Steps to Reproduce
步骤:
create snapshot sp01 for account sys ;
drop database big_data_test;
restore account sys from snapshot sp01;
big_data_test中有28张表,数据量大约有2T,找不到的表table_with_pk_index_for_write_1b的schema:
create table if not exists big_data_test.table_with_pk_index_for_write_1B( id bigint primary key, col1 tinyint, col2 smallint, col3 int, col4 bigint, col5 tinyint unsigned, col6 smallint unsigned, col7 int unsigned, col8 bigint unsigned, col9 float, col10 double, col11 varchar(255), col12 Date, col13 DateTime, col14 timestamp, col15 bool, col16 decimal(16,6), col17 text, col18 json, col19 blob, col20 binary(255), col21 varbinary(255), col22 vecf32(3), col23 vecf32(3), col24 vecf64(3), col25 vecf64(3));
Additional information
No response
原因已大致定位到.
今天恢复也有这个问题:步骤与issue中提到的基本一样
mysql> select git_version();
+---------------+
| git_version() |
+---------------+
| 29a0c5d |
+---------------+
1 row in set (0.00 sec)
快照读能读到数据:
log:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22-Ur%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241017%5C%22%7D%20%7C%3D%20%60txn%20is%20stale%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221729256400000%22,%22to%22:%221729260000000%22%7D%7D%7D&schemaVersion=1&orgId=1
PR is on the way!
PR 可能只解决了,导致这个问题的原因之一,但如果概率比较大,可能还有其他原因,需要加日志再复现下.
下午根据pitr恢复2T数据也报该错误
commit: 8d7e7b8
恢复执行的sql: restore from pitr p01 "2024-10-22 03:24:18.547701"
log:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22MBn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241021%5C%22%7D%20%7C%3D%20%60txn%20is%20stale%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221729587600000%22,%22to%22:%221729591200000%22%7D%7D%7D&schemaVersion=1&orgId=1
又完善了Log , 线上,线下同时在复现。 应该是其他原因,导致了这个问题.
复现步骤:
-
修改 以下配置: [tn.Ckp] flush-interval = "5s" min-count = 1 scan-interval = "5s" incremental-interval = "10s" global-min-count = 3
-
修改程序: gcPartitionStateTicker = 5 * time.Second gcPartitionStateTimer = 90 * time.Second
-
运行mo-service
-
运行 sql: 1>create table tpcc_1000.bmsql_order_line
2>load data url s3option {'endpoint'='http://cos.ap-guangzhou.myqcloud.com','access_key_id'='','secret_access_key'='','bucket'='mo-load-guangzhou-1308875761','filepath'='tpcc_1000/order-line.csv', 'compression'=''} into table tpcc_1000.bmsql_order_line fields terminated by ',' lines terminated by '\n' parallel 'true';
3> create snapshot 1; 4> drop database tpcc_1000; 5> restore account sys from snapshot sp01;
原因已定位,等待修复. 是 txn is stale 的错误,导致了报表找不到的错误. txn is stale 的原因是 partition state 的 minTs, start, end 的数据不一致导致.
又完善了Log , 线上,线下同时在复现。 应该是其他原因,导致了这个问题.
经过线下复现,原因就是第一个pr 所修复的,只是修复失败.
由Txn is stale 导致的 table not found 问题应该修复了,线下测试过好多次了. @Ariznawlll 请测试.
等待pr 合并
fixed
测试中
用飞哥的复现方式没有出现,测试步骤
复现步骤:
mysql> select git_version();
+---------------+
| git_version() |
+---------------+
| 700ee56cf |
+---------------+
1 row in set (0.00 sec)
修改 以下配置:
[tn.Ckp]
flush-interval = "5s"
min-count = 1
scan-interval = "5s"
incremental-interval = "10s"
global-min-count = 3
修改程序:
gcPartitionStateTicker = 5 * time.Second
gcPartitionStateTimer = 90 * time.Second
mysql> create table bmsql_order_line (
-> ol_w_id integer not null,
-> ol_d_id integer not null,
-> ol_o_id integer not null,
-> ol_number integer not null,
-> ol_i_id integer not null,
-> ol_delivery_d timestamp,
-> ol_amount decimal(6,2),
-> ol_supply_w_id integer,
-> ol_quantity integer,
-> ol_dist_info char(24),
-> primary key (ol_w_id, ol_d_id, ol_o_id, ol_number)
-> ) ;
Query OK, 0 rows affected (0.07 sec)
mysql> load data url s3option {'endpoint'='http://cos.ap-guangzhou.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-guangzhou-1308875761','filepath'='tpcc_1000/order-line.csv', 'compression'=''} into table tpcc_1000.bmsql_order_line fields terminated by ',' lines terminated by '\n' parallel 'true';
Query OK, 300014864 rows affected (17 min 14.86 sec)
mysql> create snapshot sp01 for account sys;
Query OK, 0 rows affected (0.06 sec)
mysql> drop database tpcc_1000;
Query OK, 1 row affected (0.11 sec)
mysql> restore account sys from snapshot sp01;
Query OK, 0 rows affected (1 min 55.28 sec)
整个集群大约有2T数据,恢复集群测试:
main commit:967435e
恢复执行的sql:restore cluster from snapshot cluster_sp;
log:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22tFc%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-snapshot-test-20241027%5C%22%7D%20%7C%3D%20%60table_with_pk_for_insert_100m%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221730183598993%22,%22to%22:%221730183688383%22%7D%7D%7D&schemaVersion=1&orgId=1
回归不通过
恢复account级别,account的数据量大约2T
mysql> select git_version(); +---------------+ | git_version() | +---------------+ | 5c6909f | +---------------+ 1 row in set (0.00 sec)
恢复成功
测试结论:
- 集群级别的恢复失败,整个集群的数据约2T;
- account级别的恢复成功,account的数据约2T.
@Ariznawlll Pls test!
测试中
1.66T集群恢复成功
2.37T数据量集群恢复成功