risingwave
risingwave copied to clipboard
feat(hummock/meta): Reliable task cancellation via heartbeats
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
What's changed and what's your intention?
Cancel compaction tasks and trigger new ones to its compaction group if no progress is made in num_blocks_{sealed/uploaded}
after a certain timeout.
To increase reliability, we retry task reporting (only for heartbeat timeout) and ultimately fall back on directly triggering new tasks for compaction group.
Checklist
- [x] I have written necessary rustdoc comments
- [ ] I have added necessary unit tests and integration tests
- [ ] All checks passed in
./risedev check
(or alias,./risedev c
)
Refer to a related PR or issue link (optional)
RFC: https://github.com/singularity-data/risingwave/issues/4457 Fixes: https://github.com/singularity-data/risingwave/issues/3677, https://github.com/singularity-data/risingwave/issues/4511
As pointed out in https://github.com/singularity-data/risingwave/issues/4379#issuecomment-1208772404, we actually need to retry report_compact_task
in a loop until it succeeds, particularly in the compaction heartbeat loop, as it is a catch-all for reporting task status (both if ordinary route for reporting success and failure do not succeed).
It remains to be decided if we use an infinite loop or retry a fixed number of times. I would go with the latter and then try (for fixed retry) to schedule for the group manually - as a robust, idempotent fail all (so at least group as a whole does not fail even if SSTs which are inputs to the task whose report has failed are permanently "trapped"). If all of the above fails, we report error and advise that LSM tree may be permanently imbalanced.
As an even more robust solution, we may store tasks we have failed to report in a set, and periodically try to report them until we finally succeed. This can allow us to eventually restore order to the system as long as reporting eventually succeeds (rather than only if reporting can succeed after 5 retries in a row). This may be a task for a future PR.
Codecov Report
Merging #4496 (ad67466) into main (a14d883) will decrease coverage by
0.04%
. The diff coverage is56.84%
.
@@ Coverage Diff @@
## main #4496 +/- ##
==========================================
- Coverage 74.21% 74.16% -0.05%
==========================================
Files 882 882
Lines 134837 135120 +283
==========================================
+ Hits 100069 100214 +145
- Misses 34768 34906 +138
Flag | Coverage Δ | |
---|---|---|
rust | 74.16% <56.84%> (-0.05%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/compute/src/server.rs | 0.00% <0.00%> (ø) |
|
src/meta/src/hummock/compaction_scheduler.rs | 17.60% <0.00%> (-1.20%) |
:arrow_down: |
src/meta/src/hummock/mock_hummock_meta_client.rs | 38.28% <0.00%> (-1.89%) |
:arrow_down: |
src/meta/src/hummock/mod.rs | 61.83% <0.00%> (ø) |
|
src/meta/src/rpc/service/hummock_service.rs | 5.92% <0.00%> (-0.27%) |
:arrow_down: |
src/rpc_client/src/meta_client.rs | 0.00% <0.00%> (ø) |
|
src/storage/compactor/src/server.rs | 0.00% <0.00%> (ø) |
|
src/storage/src/hummock/hummock_meta_client.rs | 0.00% <0.00%> (ø) |
|
src/storage/src/hummock/compactor/mod.rs | 73.78% <30.50%> (-3.04%) |
:arrow_down: |
src/meta/src/hummock/compactor_manager.rs | 75.00% <63.57%> (-9.91%) |
:arrow_down: |
... and 24 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
There is one more refactor we may consider in this PR, or later PR:
Currently assign_compaction_task
executes both task assignment and task send. It was originally written this way to ease error handling. But it's improper because 1. it requires compaction lock while task send is slow. 2. task is sent before assigned (see assign_compaction_task
). 3. it's essentially not any safer than simply assign and send in two steps.
https://github.com/singularity-data/risingwave/blob/bb68c57903eb896c9d65e043f72d22923dbd9603/src/meta/src/hummock/compaction_scheduler.rs#L187
That sounds quite orthogonal so perhaps I'll look into it in follow up PR. I'll create an issue for it.
Any updates?
Sorry. I've merged main today. I'll try to address all the review comments by tomorrow morning.
I've made the suggestion changes. In addition, the unit test runs and passes with RUSTFLAGS="--cfg madsim"
My apologies, I fell sick in the latter half of last week. I've cleaned up some remaining issues.
Hi @Li0k would you please take a look? :)
Hi @Li0k would you please take a look? :)
Sorry, I will finish the review this two days
Note that this PR doesn't handle cancellation of picked but unassigned task.
As a result, we must ensure an unassigned task is eventually either assigned, or cancelled during runtime. (All unassigned tasks are also cancelled in one go during meta bootstrap, but that's the last guard). So here is one mistake I need to fix: this method tries to cancel (may unassigned) task only once, rather than ensures its success https://github.com/singularity-data/risingwave/blob/9cd73e12ba1aa39992c5a619ad2f72eca258e8e4/src/meta/src/hummock/manager/mod.rs#L1388
As a comparison, this code will ensure an unassigned task is eventually either assigned, or cancelled: https://github.com/singularity-data/risingwave/blob/9cd73e12ba1aa39992c5a619ad2f72eca258e8e4/src/meta/src/hummock/compaction_scheduler.rs#L153
@zwang28 @Li0k @jon-chuang Shall we prioritize merging this PR?
Shall we prioritize merging this PR
Need resolve these for correctness @jon-chuang https://github.com/risingwavelabs/risingwave/pull/4496#discussion_r956209913 https://github.com/risingwavelabs/risingwave/pull/4496#discussion_r956837883
Seems a lot conflict to be resolved.
Seems a lot conflict to be resolved.
It's ok to prioritize this PR first.
Compaction is triggered by LSM write, so currently it's possible no compaction is triggered when a meta/compactor is (re)started and no new write is committed. Let's fix this in another PR.
Note that this PR doesn't handle cancellation of picked but unassigned task.
Is this fixed? Yes, right, by cancel_compaction_task
Compaction is triggered by LSM write, so currently it's possible no compaction is triggered when a cluster is (re)started and no write is committed. Let's fix this in another PR.
Indeed, we need to trigger compaction for all groups whenever we restart the cluster. Since this is a noop if a particular compaction group will not produce a task, there is no cost to doing this.
(in fact, there is no cost to periodically checking in a tokio task for new compaction tasks for all compaction groups rather than to have a reactive approach as we do now... Perhaps this is a better solution as it can help us make progress on compaction even if we've made a mistake or there is a fault somewhere in our triggering of new tasks?)
Note that this PR doesn't handle cancellation of picked but unassigned task.
This is not fixed yet, but only manual compaction is affected by this issue and I will fix it soon. Auto scheduled compaction is not affected by it and behaves correctly. https://github.com/risingwavelabs/risingwave/pull/4496#issuecomment-1229435794
Indeed, we need to trigger compaction for all groups whenever we restart the cluster.
And there are more cases that requires an explicit trigger, e.g. a compactor is recovered.
(in fact, there is no cost to periodically checking in a tokio task for new compaction tasks for all compaction groups rather than to have a reactive approach as we do now... Perhaps this is a better solution as it can help us make progress on compaction even if we've made a mistake or there is a fault somewhere in our triggering of new tasks?)
The cost of periodically checking is this RW lock. That is, checking task may block reporting task, as checking do takes time.
Yes a periodical checking is needed. I think the benefits of current approach is a new task can be scheduled right after LSM write, rather than waiting for certain time interval. I think we can periodically send signal to compaction scheduler for all compaction groups.
The cost of periodically checking is this RW lock. I think we can periodically send signal to compaction scheduler for all compaction groups.
I see. I think the cost is low if we do a periodic unconditional schedule relatively infrequently. Say, once every 30 seconds.
Anw, to be clear, I meant a periodic schedule in addition to a reactive schedule.
I think this PR is ready for merge now? @jon-chuang
Fixed above mentioned issue in https://github.com/risingwavelabs/risingwave/pull/5102 https://github.com/risingwavelabs/risingwave/pull/5104
I've tested this PR in e2e setting by commenting out compact_done.
When commented, we get our desired tracing messages
e.g.
2022-09-05T07:41:25.189742Z INFO risingwave_meta::hummock::manager: Task with task_id 42 with context_id 1 has expired due to lack of visible progress
2022-09-05T07:41:25.189935Z INFO risingwave_meta::hummock::manager: CancelTask operation for task_id 42 has been sent to node with context_id 1
After merging, I'll test the PR again to ensure that long running tasks will not trigger task expiry as long as they make progress (my tpch-bench is not working properly atm).