elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Introduce max headroom for disk watermark stages

Open kingherc opened this issue 3 years ago • 26 comments

Introduce max headroom settings for the low, high, and flood disk watermark stages, similar to the existing max headroom setting for the flood stage of the frozen tier. Also, convert the disk watermarks to RelativeByteSizeValue, similar to the existing setting for the flood stage of the frozen tier.

Introduce new max headrooms in HealthMetadata and in ReactiveStorageDeciderService.

Add multiple tests in DiskThresholdDeciderUnitTests, DiskThresholdDeciderTests and DiskThresholdMonitorTests.

Fixes #81406

kingherc avatar Jul 20 '22 10:07 kingherc

Hi @kingherc, I've created a changelog YAML for you.

elasticsearchmachine avatar Jul 20 '22 12:07 elasticsearchmachine

Hi!

@DaveCTurner , because @henningandersen may see this later , I added you as reviewer because I noticed your name in several tests I modified; please tell me if you'd prefer to add someone else or wait for Henning. @gmarouli , I added you as secondary reviewer because I touched some HealthMetadata parts you had recently written -- with any luck if this PR makes the 8.4.0 version, there will not be any compatibility issues to handle. But feel free to tell me any tips for handling backwards compatibility in case this PR slips the deadline for the 8.4.0 version.

Since this is my first large PR, please tell me tips if you think I should do something differently.

Some notes and questions for you:

  • Feel free to follow this reviewing order for your ease:
  1. disk_allocator.asciidoc
  2. DiskThresholdSettings.java
  3. DiskThresholdMonitor.java
  4. DiskThresholdDecider.java
  5. HealthMetadata.java
  6. HealthMetadataService.java
  7. Then the rest, including the tests.
  • I introduced a boolean switch in several tests to make sure we test the new max headroom settings. On purpose, I split tests to test both percentage watermarks and the new max headroom settings. I could, alternatively, not split the tests and rather introduce the switch with a testMaxHeadroom=randomBoolean() inside the test, so as to randomly test either the percentage settings or the max headrooms. Please tell me if you think this would be better.
  • In modifying disk_allocator.asciidoc to add the new max headroom settings, I saw a {ess-icon} next to the watermark settings. I added the same icon. But, should I do anything related to ESS or notify the ESS team?
  • There was a rounding error happening calculating disk usage in some cases. Corrected it in a commit.

kingherc avatar Jul 20 '22 14:07 kingherc

I wonder if it would be possible for you to split it up into steps for ease of reviewing.

By "split it up into steps" I probably mean doing each step as a separate PR. Sometimes it works to have a sequence of commits in a single PR but it's hard to iterate on that kind of change.

DaveCTurner avatar Jul 20 '22 16:07 DaveCTurner

Hi @DaveCTurner , I get you. I also thought of this problem in the last couple of days. I was mostly working on the tests so that they are comprehensive, and the test-related changes of the new max headroom settings are the bulk of the PR. The good thing is that the test changes can be reviewed without going back-and-forward. I believe following the order of reading I recommended will get you through the main code changes, while the rest of the test-related changes can be read in a forward-reading manner at your own leisure/pace.

Some of the changes that you mentioned could be done in separate commits, to ease reviewing. I can do that tomorrow. I am a bit hesitant to split into separate PRs, because:

  • I would ideally see this through sooner than later to make it in the 8.4.0 version FF, if there are no major concerns of course, else I would need to figure out the compatibility issues with the HealthMetadataService (which may not be major, but I did not need to care for them now), and
  • of the additional effort to extricate seemingly orthogonal changes (which I am not yet 100% sure they will be totally orthogonal) that can be self-standing PRs passing the merge gates separately. While with separate commits, this will not be an issue for the merge gate.

Taking the above in consideration, if it is OK with you to review separate commits, it would be great. However, if you believe that I should split in separate PRs, that is also perfectly fine, I can go ahead and do that!

kingherc avatar Jul 20 '22 18:07 kingherc

As things stand I don't think I'll be able to review this satisfactorily by 8.4.0 FF - there's just too much going on in this one PR and this stuff is important to get right. Let's not fixate on that date too much, we will be able to make this compatible in a later version if needed.

DaveCTurner avatar Jul 21 '22 06:07 DaveCTurner

Hi @DaveCTurner , that makes sense, it is a very tight window until FF. No problem, please review when you can, and indeed I will make it compatible as/if needed after the FF. I will go ahead and split commits for easier reviewing, but please tell me if I should split PRs. BTW, if you do not review it this week, then @henningandersen can review instead of you next week, since I believe either one of your reviews would be sufficient and since he seems more relevant to how he introduced the max headroom setting for the frozen tier. Of course if you still want to review, feel free, your comments are definitely appreciated :)

@gmarouli , when you also review, please comment on compatibility implications, assuming this PR makes it after the 8.4.0 FF. E.g., if there would be any issues introducing the new max headroom in the HealthMetadata after 8.4.0 with some version if switches in the serialization and/or XContent (is HealthMetadata exposed to REST?).

I am thinking I should not merge this with only one approval. So I will wait for at least two approvals (@gmarouli and at least one of @DaveCTurner or @henningandersen ). Please tell me if I should do differently. Thanks so much for taking the time to review this!

kingherc avatar Jul 21 '22 08:07 kingherc

:+1: I think it might be tight to have Henning review all of this before FF too but let's see. I'm very comfortable reviewing the bits that I highlighted as possible preliminaries, especially since they are refactorings which shouldn't change behaviour, and I expect that will make the behaviour-changing side of this PR much more manageable. That's why I'd rather approve it in steps so I'd still like to see separate PRs please.

One LGTM is usually acceptable unless the reviewer indicates otherwise.

DaveCTurner avatar Jul 21 '22 08:07 DaveCTurner

Thanks for the tips @DaveCTurner ! If one LGTM is enough, I believe I could disengage you & Henning, and ask @gmarouli if you would be comfortable to review this, maybe in time for the FF? Feel free to tell me if you'd like separate commits/PRs and I can do them immediately.

kingherc avatar Jul 21 '22 10:07 kingherc

@kingherc thank you for picking this up, I am very excited to see this coming through!

gmarouli avatar Jul 21 '22 11:07 gmarouli

Thanks for the tips @DaveCTurner ! If one LGTM is enough, I believe I could disengage you & Henning, and ask @gmarouli if you would be comfortable to review this, maybe in time for the FF? Feel free to tell me if you'd like separate commits/PRs and I can do them immediately.

@kingherc , honestly I am not comfortable enough being the primary reviewer for this change. I have gained some knowledge lately on the watermarks but it is still pretty fresh and I do not think it's complete :). I would prefer that @DaveCTurner or @henningandersen take a look too.

gmarouli avatar Jul 21 '22 11:07 gmarouli

@kingherc , honestly I am not comfortable enough being the primary reviewer for this change. I have gained some knowledge lately on the watermarks but it is still pretty fresh and I do not think it's complete :). I would prefer that @DaveCTurner or @henningandersen take a look too.

Great, thanks @gmarouli for starting the review ! I will try to address your issues as soon as possible. I also saw your bwc comments, and I see that there are no major concerns with handling bwc in the next version, so I feel more comfortable with this slipping the FF. Since there is no rush, I will let @henningandersen review this as well once he's back next week, since he's most relevant to the code changes.

@DaveCTurner , I think I can disengage you from reviewers. In the meantime, I may try to carve out some PRs from this big PR, for the preliminary things we mentioned, and in that case I may add you there as reviewer. If that happens, and those smaller PRs get merged, I will keep in mind to merge master again in this PR, which will lighten the reviewing load of this big PR.

kingherc avatar Jul 21 '22 11:07 kingherc

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine avatar Jul 21 '22 12:07 elasticsearchmachine

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine avatar Jul 21 '22 12:07 elasticsearchmachine

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine avatar Jul 21 '22 12:07 elasticsearchmachine

Deployed on ESS and turned healthy without any log mentions of headroom.

kingherc avatar Jul 21 '22 12:07 kingherc

Carved out the disk usage double rounding issue into separate PR: https://github.com/elastic/elasticsearch/pull/88683 . Merged and this PR now is lighter.

kingherc avatar Jul 21 '22 14:07 kingherc

Thanks @gmarouli for the review so far! Please note that as @DaveCTurner requested, I extricated the part related to converting the disk watermarks to RelativeByteSizeValues to another PR at https://github.com/elastic/elasticsearch/pull/88719 . When that PR gets reviewed and merged, it will lighten the review load of this PR.

I am thinking of whether I should do anything with this PR. Maybe rebasing it on top of the other PR.

kingherc avatar Jul 22 '22 11:07 kingherc

I am thinking of whether I should do anything with this PR. Maybe rebasing it on top of the other PR.

Once #88719 is merged to master you can merge master into this branch and that will simplify things. Rebasing isn't as good at preserving review comments and context, or at least it wasn't in the past (maybe the bugs are fixed now).

DaveCTurner avatar Jul 22 '22 12:07 DaveCTurner

I am thinking of whether I should do anything with this PR. Maybe rebasing it on top of the other PR.

Once #88719 is merged to master you can merge master into this branch and that will simplify things. Rebasing isn't as good at preserving review comments and context, or at least it wasn't in the past (maybe the bugs are fixed now).

Yes makes sense. I will also cherry-pick all the commits I did in the other PR before merging, to hopefully avoid resolving conflicts. :crossed_fingers:

kingherc avatar Jul 22 '22 13:07 kingherc

Hi! As @DaveCTurner suggested, we first merged in master some sub-PRs which greatly reduced this PR and the reviewing load. Now I merged master and this is ready to be reviewed. Since @henningandersen should be back, I suggest he does the reviewing. I will handle any feedback and any conflicts that may occur (due to FF and master changes) from Tuesday. Of course, @DaveCTurner , feel free to review as well if you'd like, no problem! @gmarouli feel free to review again the HealthMetada-related parts. Thanks everybody for the help and feedback so far, really appreciated!

kingherc avatar Jul 23 '22 20:07 kingherc

@kingherc HealthMetadata have changed from Metadata.Custom to ClusterState.Custom effectively removing the need for parsing XContent (https://github.com/elastic/elasticsearch/pull/88736). If you merge this change you will be able to clean that part of the code.

gmarouli avatar Jul 25 '22 10:07 gmarouli

Hi @DaveCTurner and @henningandersen , I incorporated all feedback so far. Please feel free to review the PR once more.

@henningandersen , there are two unresolved conversations above, please take a look at them.

Thanks!

kingherc avatar Aug 01 '22 17:08 kingherc

@elasticmachine run elasticsearch-ci/part-1

kingherc avatar Aug 01 '22 21:08 kingherc

re-running part-1 since I found the mentioned error reproduces on master and opened https://github.com/elastic/elasticsearch/issues/89015

kingherc avatar Aug 01 '22 21:08 kingherc

@elasticmachine run elasticsearch-ci/part-1

kingherc avatar Aug 03 '22 09:08 kingherc

Hi @henningandersen , @DaveCTurner , gentle reminder for reviewing.

kingherc avatar Aug 04 '22 10:08 kingherc

@henningandersen feel free to review once more. I resolved the previous comments/conversations.

kingherc avatar Aug 12 '22 09:08 kingherc

Hi @henningandersen , thanks for the feedback! I handled it. Feel free to give it another review. (something is currently weird with the test infrastructure -- do not mind these, I will get them to green next week).

kingherc avatar Aug 12 '22 16:08 kingherc

@elasticmachine test this please

kingherc avatar Aug 13 '22 18:08 kingherc

Hi @henningandersen ! gentle reminder for giving another review

kingherc avatar Aug 18 '22 14:08 kingherc