Fix(manual_compact):fix replica lose manual compact finished status after replica migrate bug
What problem does this PR solve?
#1665
What is changed and how does it work?
I add another string in meta_store (meta_store is a column families which persist pegasus_last_manual_compact_finish_time and so on).
So that we can read the value from meta_store when we recover a replica after replica migrate.
Checklist
Tests
- Manual test 1.Create table,and put some data in it. 2.Let table begin to do manual compaction 3.Finish manual compaction and view the replica relationship of this table. 4.Stop a node,and check the progress of compaction. 5.Wait a few minutes,and check if still have some replica can not show a Finish status.
Test Result
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 100
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 83
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 83
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 100
Code changes
- try to read
LAST_MANUAL_COMPACT_USED_TIMEfrom meta_store when DB exist. - Set
LAST_MANUAL_COMPACT_USED_TIMEto zero when DB not exist. - Set
LAST_MANUAL_COMPACT_USED_TIMEto the time replica last used. - Generate a new checkpoint after we change LAST_MANUAL_COMPACT_USED_TIME in func
do_manual_compact
I’m confused how can this patch fix this issue, I guess the reason the compaction progress not update is caused by the serving and the progress is not well updated after a node shutdown.
I’m confused how can this patch fix this issue, I guess the reason the compaction progress not update is caused by the serving and the progress is not well updated after a node shutdown.
Yes,the progress is not well updated after replica restart. And We can see the current logic how a replica judge itself finish compact this round or not:
if (start_time_ms > 0) {
return dsn::replication::manual_compaction_status::RUNNING;
} else if (enqueue_time_ms > 0) {
return dsn::replication::manual_compaction_status::QUEUING;
} else if (last_time_used_ms > 0 && last_finish_time_ms > 0) {
return dsn::replication::manual_compaction_status::FINISHED;
} else {
return dsn::replication::manual_compaction_status::IDLE;
}
When replica restart or migrate,last_time_used_ms will loss.
And this replica will try to enter task queue again,however compaction's time checker prevent doing it util next day.
So I add LAST_MANUAL_COMPACT_USED_TIME and persist this parameter to colum families. So that this replica can be judged status finish after replica migrate.
I’m confused how can this patch fix this issue, I guess the reason the compaction progress not update is caused by the serving and the progress is not well updated after a node shutdown.
Yes,the progress is not well updated after replica restart. And We can see the current logic how a replica judge itself finish compact this round or not:
if (start_time_ms > 0) { return dsn::replication::manual_compaction_status::RUNNING; } else if (enqueue_time_ms > 0) { return dsn::replication::manual_compaction_status::QUEUING; } else if (last_time_used_ms > 0 && last_finish_time_ms > 0) { return dsn::replication::manual_compaction_status::FINISHED; } else { return dsn::replication::manual_compaction_status::IDLE; }When replica restart or migrate,
last_time_used_mswill loss. And this replica will try to enter task queue again,however compaction's time checker prevent doing it util next day. So I addLAST_MANUAL_COMPACT_USED_TIMEand persist this parameter to colum families. So that this replica can be judged statusfinishafter replica migrate.
How to comprehend compaction's time checker prevent doing it util next day, do you mean the logic in bool pegasus_manual_compact_service::check_manual_compact_state(), in this case, transfer the status from manual_compaction_status::IDLE to manual_compaction_status::FINISHED may help?
pegasus_manual_compact_service::check_manual_compact_state
Yes, pegasus_manual_compact_service::check_manual_compact_state() will prevent compaction log doing with this replica util next day.
So, if we have the parameter last_finish_time_ms, pegasus_manual_compact_service::query_compact_status will response the caller that 'this replica is FINISH manual compaction'.
In other words, I do not think we should transfer the status,just need tell the caller when it send a query request.