risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

feat(hummock/meta): Reliable task cancellation via heartbeats

Open jon-chuang opened this issue 2 years ago • 4 comments

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

Cancel compaction tasks and trigger new ones to its compaction group if no progress is made in num_blocks_{sealed/uploaded} after a certain timeout.

To increase reliability, we retry task reporting (only for heartbeat timeout) and ultimately fall back on directly triggering new tasks for compaction group.

Checklist

  • [x] I have written necessary rustdoc comments
  • [ ] I have added necessary unit tests and integration tests
  • [ ] All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

RFC: https://github.com/singularity-data/risingwave/issues/4457 Fixes: https://github.com/singularity-data/risingwave/issues/3677, https://github.com/singularity-data/risingwave/issues/4511

jon-chuang avatar Aug 08 '22 08:08 jon-chuang

As pointed out in https://github.com/singularity-data/risingwave/issues/4379#issuecomment-1208772404, we actually need to retry report_compact_task in a loop until it succeeds, particularly in the compaction heartbeat loop, as it is a catch-all for reporting task status (both if ordinary route for reporting success and failure do not succeed).

It remains to be decided if we use an infinite loop or retry a fixed number of times. I would go with the latter and then try (for fixed retry) to schedule for the group manually - as a robust, idempotent fail all (so at least group as a whole does not fail even if SSTs which are inputs to the task whose report has failed are permanently "trapped"). If all of the above fails, we report error and advise that LSM tree may be permanently imbalanced.


As an even more robust solution, we may store tasks we have failed to report in a set, and periodically try to report them until we finally succeed. This can allow us to eventually restore order to the system as long as reporting eventually succeeds (rather than only if reporting can succeed after 5 retries in a row). This may be a task for a future PR.

jon-chuang avatar Aug 09 '22 01:08 jon-chuang

Codecov Report

Merging #4496 (ad67466) into main (a14d883) will decrease coverage by 0.04%. The diff coverage is 56.84%.

@@            Coverage Diff             @@
##             main    #4496      +/-   ##
==========================================
- Coverage   74.21%   74.16%   -0.05%     
==========================================
  Files         882      882              
  Lines      134837   135120     +283     
==========================================
+ Hits       100069   100214     +145     
- Misses      34768    34906     +138     
Flag Coverage Δ
rust 74.16% <56.84%> (-0.05%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/compute/src/server.rs 0.00% <0.00%> (ø)
src/meta/src/hummock/compaction_scheduler.rs 17.60% <0.00%> (-1.20%) :arrow_down:
src/meta/src/hummock/mock_hummock_meta_client.rs 38.28% <0.00%> (-1.89%) :arrow_down:
src/meta/src/hummock/mod.rs 61.83% <0.00%> (ø)
src/meta/src/rpc/service/hummock_service.rs 5.92% <0.00%> (-0.27%) :arrow_down:
src/rpc_client/src/meta_client.rs 0.00% <0.00%> (ø)
src/storage/compactor/src/server.rs 0.00% <0.00%> (ø)
src/storage/src/hummock/hummock_meta_client.rs 0.00% <0.00%> (ø)
src/storage/src/hummock/compactor/mod.rs 73.78% <30.50%> (-3.04%) :arrow_down:
src/meta/src/hummock/compactor_manager.rs 75.00% <63.57%> (-9.91%) :arrow_down:
... and 24 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Aug 09 '22 05:08 codecov[bot]

There is one more refactor we may consider in this PR, or later PR: Currently assign_compaction_task executes both task assignment and task send. It was originally written this way to ease error handling. But it's improper because 1. it requires compaction lock while task send is slow. 2. task is sent before assigned (see assign_compaction_task). 3. it's essentially not any safer than simply assign and send in two steps.

https://github.com/singularity-data/risingwave/blob/bb68c57903eb896c9d65e043f72d22923dbd9603/src/meta/src/hummock/compaction_scheduler.rs#L187

zwang28 avatar Aug 09 '22 12:08 zwang28

That sounds quite orthogonal so perhaps I'll look into it in follow up PR. I'll create an issue for it.

jon-chuang avatar Aug 09 '22 23:08 jon-chuang

Any updates?

hzxa21 avatar Aug 16 '22 04:08 hzxa21

Sorry. I've merged main today. I'll try to address all the review comments by tomorrow morning.

jon-chuang avatar Aug 16 '22 09:08 jon-chuang

I've made the suggestion changes. In addition, the unit test runs and passes with RUSTFLAGS="--cfg madsim"

jon-chuang avatar Aug 17 '22 08:08 jon-chuang

My apologies, I fell sick in the latter half of last week. I've cleaned up some remaining issues.

jon-chuang avatar Aug 23 '22 02:08 jon-chuang

Hi @Li0k would you please take a look? :)

jon-chuang avatar Aug 25 '22 09:08 jon-chuang

Hi @Li0k would you please take a look? :)

Sorry, I will finish the review this two days

Li0k avatar Aug 25 '22 11:08 Li0k

Note that this PR doesn't handle cancellation of picked but unassigned task.

As a result, we must ensure an unassigned task is eventually either assigned, or cancelled during runtime. (All unassigned tasks are also cancelled in one go during meta bootstrap, but that's the last guard). So here is one mistake I need to fix: this method tries to cancel (may unassigned) task only once, rather than ensures its success https://github.com/singularity-data/risingwave/blob/9cd73e12ba1aa39992c5a619ad2f72eca258e8e4/src/meta/src/hummock/manager/mod.rs#L1388

As a comparison, this code will ensure an unassigned task is eventually either assigned, or cancelled: https://github.com/singularity-data/risingwave/blob/9cd73e12ba1aa39992c5a619ad2f72eca258e8e4/src/meta/src/hummock/compaction_scheduler.rs#L153

zwang28 avatar Aug 28 '22 11:08 zwang28

@zwang28 @Li0k @jon-chuang Shall we prioritize merging this PR?

hzxa21 avatar Aug 31 '22 08:08 hzxa21

Shall we prioritize merging this PR

Need resolve these for correctness @jon-chuang https://github.com/risingwavelabs/risingwave/pull/4496#discussion_r956209913 https://github.com/risingwavelabs/risingwave/pull/4496#discussion_r956837883

zwang28 avatar Aug 31 '22 09:08 zwang28

Seems a lot conflict to be resolved.

Little-Wallace avatar Sep 01 '22 07:09 Little-Wallace

Seems a lot conflict to be resolved.

It's ok to prioritize this PR first.

Gun9niR avatar Sep 01 '22 16:09 Gun9niR

Compaction is triggered by LSM write, so currently it's possible no compaction is triggered when a meta/compactor is (re)started and no new write is committed. Let's fix this in another PR.

zwang28 avatar Sep 05 '22 06:09 zwang28

Note that this PR doesn't handle cancellation of picked but unassigned task.

Is this fixed? Yes, right, by cancel_compaction_task

Compaction is triggered by LSM write, so currently it's possible no compaction is triggered when a cluster is (re)started and no write is committed. Let's fix this in another PR.

Indeed, we need to trigger compaction for all groups whenever we restart the cluster. Since this is a noop if a particular compaction group will not produce a task, there is no cost to doing this.

(in fact, there is no cost to periodically checking in a tokio task for new compaction tasks for all compaction groups rather than to have a reactive approach as we do now... Perhaps this is a better solution as it can help us make progress on compaction even if we've made a mistake or there is a fault somewhere in our triggering of new tasks?)

jon-chuang avatar Sep 05 '22 07:09 jon-chuang

Note that this PR doesn't handle cancellation of picked but unassigned task.

This is not fixed yet, but only manual compaction is affected by this issue and I will fix it soon. Auto scheduled compaction is not affected by it and behaves correctly. https://github.com/risingwavelabs/risingwave/pull/4496#issuecomment-1229435794

zwang28 avatar Sep 05 '22 07:09 zwang28

Indeed, we need to trigger compaction for all groups whenever we restart the cluster.

And there are more cases that requires an explicit trigger, e.g. a compactor is recovered.

(in fact, there is no cost to periodically checking in a tokio task for new compaction tasks for all compaction groups rather than to have a reactive approach as we do now... Perhaps this is a better solution as it can help us make progress on compaction even if we've made a mistake or there is a fault somewhere in our triggering of new tasks?)

The cost of periodically checking is this RW lock. That is, checking task may block reporting task, as checking do takes time.

Yes a periodical checking is needed. I think the benefits of current approach is a new task can be scheduled right after LSM write, rather than waiting for certain time interval. I think we can periodically send signal to compaction scheduler for all compaction groups.

zwang28 avatar Sep 05 '22 07:09 zwang28

The cost of periodically checking is this RW lock. I think we can periodically send signal to compaction scheduler for all compaction groups.

I see. I think the cost is low if we do a periodic unconditional schedule relatively infrequently. Say, once every 30 seconds.

Anw, to be clear, I meant a periodic schedule in addition to a reactive schedule.

jon-chuang avatar Sep 05 '22 07:09 jon-chuang

I think this PR is ready for merge now? @jon-chuang

Fixed above mentioned issue in https://github.com/risingwavelabs/risingwave/pull/5102 https://github.com/risingwavelabs/risingwave/pull/5104

zwang28 avatar Sep 05 '22 08:09 zwang28

I've tested this PR in e2e setting by commenting out compact_done.

When commented, we get our desired tracing messages

e.g.

2022-09-05T07:41:25.189742Z  INFO risingwave_meta::hummock::manager: Task with task_id 42 with context_id 1 has expired due to lack of visible progress
2022-09-05T07:41:25.189935Z  INFO risingwave_meta::hummock::manager: CancelTask operation for task_id 42 has been sent to node with context_id 1

After merging, I'll test the PR again to ensure that long running tasks will not trigger task expiry as long as they make progress (my tpch-bench is not working properly atm).

jon-chuang avatar Sep 05 '22 08:09 jon-chuang