matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: tpcc stability test report 'Duplicate entry '3a15013a15033a160d66' for key '__mo_cpkey_col'' on distributed mode

Open aressu1985 opened this issue 2 years ago • 13 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

1.1-dev

Commit ID

9d120d4

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
- OS type:
- Others:

Actual Behavior

during stability test on distributed mode, there some error "Duplicate entry '3a15013a15033a160d66' for key '__mo_cpkey_col'" for tpcc test.

tpcc-longrunning-test/mo-tpcc/benchmarksql-info.log:2023-12-20 14:03:33 FATAL jTPCCTerminal:325 - [UNEXPECTED][TT_NEW_ORDER][EXECUTION] ErrorCode : 1062, ErrorMessage : Duplicate entry '3a15013a15013a161336' for key '__mo_cpkey_col' tpcc-longrunning-test/mo-tpcc/benchmarksql-info.log:2023-12-22 03:49:56 FATAL jTPCCTerminal:325 - [UNEXPECTED][TT_NEW_ORDER][EXECUTION] ErrorCode : 1062, ErrorMessage : Duplicate entry '(1,1,7575,1)' for key '__mo_cpkey_col' tpcc-longrunning-test/mo-tpcc/benchmarksql-info.log:2023-12-22 03:49:56 FATAL jTPCCTerminal:325 - [UNEXPECTED][TT_NEW_ORDER][EXECUTION] ErrorCode : 1062, ErrorMessage : Duplicate entry '(1,1,7575)' for key '__mo_cpkey_col' tpcc-longrunning-test/mo-tpcc/tpcc.log:2023-12-22 03:49:56 FATAL jTPCCTerminal:325 - [UNEXPECTED][TT_NEW_ORDER][EXECUTION] ErrorCode : 1062, ErrorMessage : Duplicate entry '(1,1,7575,1)' for key '__mo_cpkey_col' tpcc-longrunning-test/mo-tpcc/tpcc.log:2023-12-22 03:49:56 FATAL jTPCCTerminal:325 - [UNEXPECTED][TT_NEW_ORDER][EXECUTION] ErrorCode : 1062, ErrorMessage : Duplicate entry '(1,1,7575)' for key '__mo_cpkey_col' tpcc-longrunning-test/mo-tpcc/benchmarksql-error-1-10.log:2023-12-20 14:03:33 FATAL jTPCCTerminal:325 - [UNEXPECTED][TT_NEW_ORDER][EXECUTION] ErrorCode : 1062, ErrorMessage : Duplicate entry '3a15013a15013a161336' for key '__mo_cpkey_col' tpcc-longrunning-test/mo-tpcc/benchmarksql-error-1-10.log:2023-12-22 03:49:56 FATAL jTPCCTerminal:325 - [UNEXPECTED][TT_NEW_ORDER][EXECUTION] ErrorCode : 1062, ErrorMessage : Duplicate entry '(1,1,7575,1)' for key '__mo_cpkey_col' tpcc-longrunning-test/mo-tpcc/benchmarksql-error-1-10.log:2023-12-22 03:49:56 FATAL jTPCCTerminal:325 - [UNEXPECTED][TT_NEW_ORDER][EXECUTION] ErrorCode : 1062, ErrorMessage : Duplicate entry '(1,1,7575)' for key '__mo_cpkey_col'

mo-log:

Expected Behavior

No response

Steps to Reproduce

run stability test on distributed mode

Additional information

No response

aressu1985 avatar Jan 12 '24 06:01 aressu1985

@nnsgmsone please take a look at this issue

XuPeng-SH avatar Jan 15 '24 02:01 XuPeng-SH

给我一下日志 @aressu1985

nnsgmsone avatar Jan 17 '24 01:01 nnsgmsone

mo-log: http://10.222.6.1/explore?panes=%7B%22Aoi%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-stability-regression-20240112%5C%22%7D%20%7C%3D%20%603a15013a15033a160d66%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221704992471641%22,%22to%22:%221704998829377%22%7D%7D%7D&schemaVersion=1&orgId=1

aressu1985 avatar Jan 17 '24 03:01 aressu1985

对于dup的情况,接下来我会制造一个埋点来强制core dump来dump整个mo的快照。具体如何dump需要稍微设计一下。。

nnsgmsone avatar Jan 22 '24 01:01 nnsgmsone

还在忙生产的bug,尚未处理

nnsgmsone avatar Jan 25 '24 10:01 nnsgmsone

正在设计和实现中

nnsgmsone avatar Jan 30 '24 10:01 nnsgmsone

正在设计和实现中

nnsgmsone avatar Feb 02 '24 10:02 nnsgmsone

no process

nnsgmsone avatar Feb 21 '24 13:02 nnsgmsone

no process

nnsgmsone avatar Feb 26 '24 10:02 nnsgmsone

处理数据正确性问题中

nnsgmsone avatar Feb 29 '24 10:02 nnsgmsone

no process

nnsgmsone avatar Mar 05 '24 10:03 nnsgmsone

no process

nnsgmsone avatar Mar 08 '24 10:03 nnsgmsone

no process

nnsgmsone avatar Mar 13 '24 10:03 nnsgmsone

no process

nnsgmsone avatar Mar 18 '24 11:03 nnsgmsone

no process

nnsgmsone avatar Mar 21 '24 10:03 nnsgmsone

已经Fix 了一部分

triump2020 avatar Mar 26 '24 10:03 triump2020

分布式稳定性测试中,还会出现dup/ww, 正在定位

triump2020 avatar Mar 31 '24 12:03 triump2020

Blocked by snapshot read.

triump2020 avatar Apr 09 '24 11:04 triump2020

Not working on this

triump2020 avatar Apr 13 '24 10:04 triump2020

https://github.com/matrixorigin/matrixone/pull/15545

triump2020 avatar Apr 16 '24 07:04 triump2020

Not working on this

triump2020 avatar Apr 22 '24 13:04 triump2020

https://github.com/matrixorigin/matrixone/pull/15731

triump2020 avatar Apr 25 '24 15:04 triump2020

原因如下:

  1. txn1 在CN1 上insert 了 一条 PK, 并committed. 2. CN2 上的txn2 还未等到 这个pk 同步到partition state 中,就开始 运行 delete pk(delete statment 的snapshot ts 应该是小于txn1 的commit ts 的, 否则CN2 会等water mark 超过txn1 的commit ts ) , 这时pk 的rowid 查不到,delete 运行之后,affected rows =0, 相当于delete 没起效果; 然后运行 insert pk , 去重时,之前被txn1 提交的pk 同步过来了,然后在partiton state 中发现了相同的pk , 导致dup.

triump2020 avatar May 06 '24 15:05 triump2020

https://github.com/matrixorigin/matrixone/pull/15948

triump2020 avatar May 09 '24 09:05 triump2020

https://github.com/matrixorigin/matrixone/pull/15992 fixed dup/ww bug.

triump2020 avatar May 11 '24 04:05 triump2020

wait for @ouyuanning 's pr

triump2020 avatar May 16 '24 10:05 triump2020

wait for @ouyuanning 's pr

triump2020 avatar May 21 '24 15:05 triump2020

Wait for @ouyuanning ’PR

triump2020 avatar May 24 '24 10:05 triump2020

In testing

triump2020 avatar May 30 '24 11:05 triump2020

等 @ouyuanning ' PR 复现之后,继续查

triump2020 avatar Jun 04 '24 12:06 triump2020