matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: [date 12.29]standalone regression: update 60million rows resume cdc to mysql cost about 3hours

Open heni02 opened this issue 1 year ago • 3 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

2.0-dev

Commit ID

8d91524f1fb38c544eaee7caf23ecc18b49d4073

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

整个同步阶段的cdc metrics: https://shanghai.idc.matrixorigin.cn:30001/d/de1wtrqxri6f4b/frontend-metrics?orgId=1&var-interval=1m&var-datasource=prometheus-standalone&var-instance=10.222.1.128:7001&from=1735542622343&to=1735555287091 企业微信截图_22dad7d4-a333-4753-b22e-cecfa3bfc294

15:21-15:45时间段:create cdc同步6千万的表(cdc性能比较预期) 15:49-18:28时间段:下游同步完6千万数据后,update6千万数据,resume cdc该表 (cdc性能不预期)

测试步骤见Steps to Reproduce

mo log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22GYP%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bhost%3D%5C%2210-222-1-128%5C%22,%20filename%3D%5C%22%2Fdata1%2Frunners%2Faction-runner%2F_work%2Fmo-nightly-regression%2Fmo-nightly-regression%2Fhead%2Fmo-service-8d91524f1-20241229-222811.log%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221735544745258%22,%22to%22:%221735554591076%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

mysql -h 127.0.0.1 -uac1:admin -P6001 -p123 -e "CREATE TABLE test_db.LINEITEM( L_ORDERKEY    BIGINT NOT NULL, L_PARTKEY     INTEGER NOT NULL, L_SUPPKEY     INTEGER NOT NULL, L_LINENUMBER  INTEGER NOT NULL, L_QUANTITY    DECIMAL(15,2) NOT NULL, L_EXTENDEDPRICE  DECIMAL(15,2) NOT NULL, L_DISCOUNT    DECIMAL(15,2) NOT NULL, L_TAX         DECIMAL(15,2) NOT NULL, L_RETURNFLAG  VARCHAR(1) NOT NULL, L_LINESTATUS  VARCHAR(1) NOT NULL, L_SHIPDATE    DATE NOT NULL, L_COMMITDATE  DATE NOT NULL, L_RECEIPTDATE DATE NOT NULL, L_SHIPINSTRUCT CHAR(25) NOT NULL, L_SHIPMODE     CHAR(10) NOT NULL, L_COMMENT      VARCHAR(44) NOT NULL, PRIMARY KEY (L_ORDERKEY, L_LINENUMBER) );"
  mysql -h 127.0.0.1 -uac1:admin -P6001 -p123 -e "load data infile '/data/mo-load-data/data/tpch_10/lineitem.tbl' into table test_db.lineitem  FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' parallel 'true';"
  cd $GITHUB_WORKSPACE/tool/mo-backup/
  echo "Create cdc_resume_6kw cdc test"
  ./mo_cdc task create --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001" --sink-type="mysql" --sink-uri="***10.222.1.129:3306" --tables='test_db.lineitem:cdc_regression_db.lineitem' --level="account" --account="ac1"
  sleep 1400s
  echo "Show cdc_resume_6kw cdc test status"
  ./mo_cdc task show --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001"
  echo "mysql data consistency verification,expect values=59986052"
  mysql -h 10.222.1.129 -u root -P3306 -pmo123456 -e "select count(*) from cdc_regression_db.lineitem;"
  echo "Pause cdc_resume_6kw cdc test"
  ./mo_cdc task pause --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001"
  # mo update data
  mysql -h 127.0.0.1 -uac1:admin -P6001 -p123 -e 'update test_db.lineitem set L_COMMENT="yyyyuuuu";'
  echo "Show cdc_resume_6kw cdc test status"
  ./mo_cdc task show --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001"
  echo "Resume cdc_resume_6kw cdc test"
  ./mo_cdc task resume --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001"
mysql -h 10.222.1.129 -u root -P3306 -pmo123456 -e 'select count(*) from cdc_regression_db.lineitem where L_COMMENT="yyyyuuuu";'

Additional information

No response

heni02 avatar Dec 30 '24 11:12 heni02

2.0.2在做mo_pubs/subs结构改造,没时间看,移到2.1.0

ck89119 avatar Jan 02 '25 02:01 ck89119

通过查看log得知,任务最开始是在执行存量同步,中间在做增量同步,增量同步没把数据同步完changes数据被gc了导致stale read,触发cdc任务重新启动又开始跑存量同步,所以上图中左右会有两个高峰,是在执行存量同步,中间低的是在跑增量同步。

任务期间下游mysql所在机器的磁盘监控,中间段磁盘的负载很低: image

重新跑了一次任务实时看了cpu负载,发现mysql进程的cpu是打满了: 企业微信截图_7b7b951d-6311-4d95-a175-f89ac92e3289

任务中间这段时间是在同步增量的数据变更,这里是update,cdc会把1条update语句变换为 delete + insert 2条sql, 生成的sql可能是这样:

delete from t1 where pk = 1;
insert into t1 values (1, whatever ...);
delete from t1 where pk = 2;
insert into t1 values (2, whatever ...);
...
delete from t1 where pk = n;
insert into t1 values (n, whatever ...);

对比同步存量数据生成的sql:

insert into t1 values (1, whatever ...), (2, whatever ...), ... (n, whatever ...);

sql语句大小相同的情况下,明显同步存量数据的sql中的有效数据会多很多,sql语句中insert into ...前缀只会出现一次,后面全是要插入的有效数据;而同步增量数据的sql,为保证数据的正确性,会出现deleteinsert交错的情况,无法把批量数据合并起来,sql语句中insertdelete前缀大量出现,真正有效的数据在sql语句中所占的比重非常低。

这大致能解释上面监控看到的现象:mysql把大量时间用于接收sql语句,parse sql语句,需要进行的IO操作反而不多,因此cpu负载很高,而磁盘反而没什么压力,这个场景的性能瓶颈在cpu。

在当前技术方案下,这属于下游cpu拉满的正常现象。

ck89119 avatar Jan 03 '25 09:01 ck89119

建议issue保留,之后版本或未来规划有方案提升性能

heni02 avatar Jan 16 '25 03:01 heni02

should be fixed

XuPeng-SH avatar Jul 02 '25 03:07 XuPeng-SH