[Bug]: [date 12.29]standalone regression: update 60million rows resume cdc to mysql cost about 3hours
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
2.0-dev
Commit ID
8d91524f1fb38c544eaee7caf23ecc18b49d4073
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
整个同步阶段的cdc metrics:
https://shanghai.idc.matrixorigin.cn:30001/d/de1wtrqxri6f4b/frontend-metrics?orgId=1&var-interval=1m&var-datasource=prometheus-standalone&var-instance=10.222.1.128:7001&from=1735542622343&to=1735555287091
15:21-15:45时间段:create cdc同步6千万的表(cdc性能比较预期) 15:49-18:28时间段:下游同步完6千万数据后,update6千万数据,resume cdc该表 (cdc性能不预期)
测试步骤见Steps to Reproduce
mo log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22GYP%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bhost%3D%5C%2210-222-1-128%5C%22,%20filename%3D%5C%22%2Fdata1%2Frunners%2Faction-runner%2F_work%2Fmo-nightly-regression%2Fmo-nightly-regression%2Fhead%2Fmo-service-8d91524f1-20241229-222811.log%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221735544745258%22,%22to%22:%221735554591076%22%7D%7D%7D&schemaVersion=1&orgId=1
Expected Behavior
No response
Steps to Reproduce
mysql -h 127.0.0.1 -uac1:admin -P6001 -p123 -e "CREATE TABLE test_db.LINEITEM( L_ORDERKEY BIGINT NOT NULL, L_PARTKEY INTEGER NOT NULL, L_SUPPKEY INTEGER NOT NULL, L_LINENUMBER INTEGER NOT NULL, L_QUANTITY DECIMAL(15,2) NOT NULL, L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL, L_DISCOUNT DECIMAL(15,2) NOT NULL, L_TAX DECIMAL(15,2) NOT NULL, L_RETURNFLAG VARCHAR(1) NOT NULL, L_LINESTATUS VARCHAR(1) NOT NULL, L_SHIPDATE DATE NOT NULL, L_COMMITDATE DATE NOT NULL, L_RECEIPTDATE DATE NOT NULL, L_SHIPINSTRUCT CHAR(25) NOT NULL, L_SHIPMODE CHAR(10) NOT NULL, L_COMMENT VARCHAR(44) NOT NULL, PRIMARY KEY (L_ORDERKEY, L_LINENUMBER) );"
mysql -h 127.0.0.1 -uac1:admin -P6001 -p123 -e "load data infile '/data/mo-load-data/data/tpch_10/lineitem.tbl' into table test_db.lineitem FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' parallel 'true';"
cd $GITHUB_WORKSPACE/tool/mo-backup/
echo "Create cdc_resume_6kw cdc test"
./mo_cdc task create --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001" --sink-type="mysql" --sink-uri="***10.222.1.129:3306" --tables='test_db.lineitem:cdc_regression_db.lineitem' --level="account" --account="ac1"
sleep 1400s
echo "Show cdc_resume_6kw cdc test status"
./mo_cdc task show --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001"
echo "mysql data consistency verification,expect values=59986052"
mysql -h 10.222.1.129 -u root -P3306 -pmo123456 -e "select count(*) from cdc_regression_db.lineitem;"
echo "Pause cdc_resume_6kw cdc test"
./mo_cdc task pause --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001"
# mo update data
mysql -h 127.0.0.1 -uac1:admin -P6001 -p123 -e 'update test_db.lineitem set L_COMMENT="yyyyuuuu";'
echo "Show cdc_resume_6kw cdc test status"
./mo_cdc task show --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001"
echo "Resume cdc_resume_6kw cdc test"
./mo_cdc task resume --task-name "cdc_resume_6kw" --source-uri="***127.0.0.1:6001"
mysql -h 10.222.1.129 -u root -P3306 -pmo123456 -e 'select count(*) from cdc_regression_db.lineitem where L_COMMENT="yyyyuuuu";'
Additional information
No response
2.0.2在做mo_pubs/subs结构改造,没时间看,移到2.1.0
通过查看log得知,任务最开始是在执行存量同步,中间在做增量同步,增量同步没把数据同步完changes数据被gc了导致stale read,触发cdc任务重新启动又开始跑存量同步,所以上图中左右会有两个高峰,是在执行存量同步,中间低的是在跑增量同步。
任务期间下游mysql所在机器的磁盘监控,中间段磁盘的负载很低:
重新跑了一次任务实时看了cpu负载,发现mysql进程的cpu是打满了:
任务中间这段时间是在同步增量的数据变更,这里是update,cdc会把1条update语句变换为 delete + insert 2条sql, 生成的sql可能是这样:
delete from t1 where pk = 1;
insert into t1 values (1, whatever ...);
delete from t1 where pk = 2;
insert into t1 values (2, whatever ...);
...
delete from t1 where pk = n;
insert into t1 values (n, whatever ...);
对比同步存量数据生成的sql:
insert into t1 values (1, whatever ...), (2, whatever ...), ... (n, whatever ...);
sql语句大小相同的情况下,明显同步存量数据的sql中的有效数据会多很多,sql语句中insert into ...前缀只会出现一次,后面全是要插入的有效数据;而同步增量数据的sql,为保证数据的正确性,会出现delete和insert交错的情况,无法把批量数据合并起来,sql语句中insert和delete前缀大量出现,真正有效的数据在sql语句中所占的比重非常低。
这大致能解释上面监控看到的现象:mysql把大量时间用于接收sql语句,parse sql语句,需要进行的IO操作反而不多,因此cpu负载很高,而磁盘反而没什么压力,这个场景的性能瓶颈在cpu。
在当前技术方案下,这属于下游cpu拉满的正常现象。
建议issue保留,之后版本或未来规划有方案提升性能
should be fixed