incubator-pegasus icon indicating copy to clipboard operation
incubator-pegasus copied to clipboard

Fix(manual_compact):fix replica lose manual compact finished status after replica migrate bug

Open ninsmiracle opened this issue 2 years ago • 3 comments

What problem does this PR solve?

#1665

What is changed and how does it work?

I add another string in meta_store (meta_store is a column families which persist pegasus_last_manual_compact_finish_time and so on). So that we can read the value from meta_store when we recover a replica after replica migrate.

Checklist

Tests
  • Manual test 1.Create table,and put some data in it. 2.Let table begin to do manual compaction 3.Finish manual compaction and view the replica relationship of this table. 4.Stop a node,and check the progress of compaction. 5.Wait a few minutes,and check if still have some replica can not show a Finish status.

Test Result

Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 100
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 83
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 83
Pegasus-AdminCli-1.2.0 » manual-compaction query -a gns
Table gns manual compaction progress is 100
Code changes
  • try to read LAST_MANUAL_COMPACT_USED_TIME from meta_store when DB exist.
  • Set LAST_MANUAL_COMPACT_USED_TIME to zero when DB not exist.
  • Set LAST_MANUAL_COMPACT_USED_TIME to the time replica last used.
  • Generate a new checkpoint after we change LAST_MANUAL_COMPACT_USED_TIME in func do_manual_compact

ninsmiracle avatar Oct 31 '23 07:10 ninsmiracle

I’m confused how can this patch fix this issue, I guess the reason the compaction progress not update is caused by the serving and the progress is not well updated after a node shutdown.

acelyc111 avatar Nov 17 '23 03:11 acelyc111

I’m confused how can this patch fix this issue, I guess the reason the compaction progress not update is caused by the serving and the progress is not well updated after a node shutdown.

Yes,the progress is not well updated after replica restart. And We can see the current logic how a replica judge itself finish compact this round or not:

if (start_time_ms > 0) {
    return dsn::replication::manual_compaction_status::RUNNING;
} else if (enqueue_time_ms > 0) {
    return dsn::replication::manual_compaction_status::QUEUING;
} else if (last_time_used_ms > 0 && last_finish_time_ms > 0) {
    return dsn::replication::manual_compaction_status::FINISHED;
} else {
    return dsn::replication::manual_compaction_status::IDLE;
}

When replica restart or migrate,last_time_used_ms will loss. And this replica will try to enter task queue again,however compaction's time checker prevent doing it util next day. So I add LAST_MANUAL_COMPACT_USED_TIME and persist this parameter to colum families. So that this replica can be judged status finish after replica migrate.

ninsmiracle avatar Nov 23 '23 06:11 ninsmiracle

I’m confused how can this patch fix this issue, I guess the reason the compaction progress not update is caused by the serving and the progress is not well updated after a node shutdown.

Yes,the progress is not well updated after replica restart. And We can see the current logic how a replica judge itself finish compact this round or not:

if (start_time_ms > 0) {
    return dsn::replication::manual_compaction_status::RUNNING;
} else if (enqueue_time_ms > 0) {
    return dsn::replication::manual_compaction_status::QUEUING;
} else if (last_time_used_ms > 0 && last_finish_time_ms > 0) {
    return dsn::replication::manual_compaction_status::FINISHED;
} else {
    return dsn::replication::manual_compaction_status::IDLE;
}

When replica restart or migrate,last_time_used_ms will loss. And this replica will try to enter task queue again,however compaction's time checker prevent doing it util next day. So I add LAST_MANUAL_COMPACT_USED_TIME and persist this parameter to colum families. So that this replica can be judged status finish after replica migrate.

How to comprehend compaction's time checker prevent doing it util next day, do you mean the logic in bool pegasus_manual_compact_service::check_manual_compact_state(), in this case, transfer the status from manual_compaction_status::IDLE to manual_compaction_status::FINISHED may help?

acelyc111 avatar Nov 23 '23 10:11 acelyc111

pegasus_manual_compact_service::check_manual_compact_state

Yes, pegasus_manual_compact_service::check_manual_compact_state() will prevent compaction log doing with this replica util next day. So, if we have the parameter last_finish_time_ms, pegasus_manual_compact_service::query_compact_status will response the caller that 'this replica is FINISH manual compaction'. In other words, I do not think we should transfer the status,just need tell the caller when it send a query request.

ninsmiracle avatar Mar 21 '24 08:03 ninsmiracle