ingress-gce icon indicating copy to clipboard operation
ingress-gce copied to clipboard

Fix instance group sizes

Open cezarygerard opened this issue 1 year ago • 7 comments

the is an error in truncating to 1000 nodes before iterating over zones when adding/removing instances from instance group

plus unit tests for multizonal case

cezarygerard avatar Jun 05 '24 16:06 cezarygerard

Please fix your commit message (typos). Would be good to give a bit more context on the change:

"the is an error in truncating to 1000 nodes before iterating over zones when adding/removing instances from instance group"

  • What is the error in the original code path? What this bug being hit frequently in production but not noticed or an edge condition that we finally saw?
  • Under what conditions will we encounter this error?
  • How will the result change between the old behavior and the new behavior?
  • What kind of testing did we add to make sure the behavior did not change other than what we want to change?

Do we want to add some more unit tests? It looks like a medium sized change with potentially large impact.

bowei avatar Jun 24 '24 23:06 bowei

@bowei Adding some context on the bug

This is the code prior to this change:

		// Individual InstanceGroup has a limit for 1000 instances in it.
		// As a result, it's not possible to add more to it.
		if len(kubeNodes) > m.maxIGSize {

The intent seemed to have limit of 1k per zone, not in total. Looking at the comment and variable name that someone meant to have a limit per IG, not per zone. Also, to add to that, there is code in the legacy ILB controller that also syncs the same IG (it's shared). If there is a cluster that for some reason uses both, both controllers will resize the instance groups, legacy making them 1k per zone while this one will make it 1k total across zones. Taking into account that the ingress-gce was initially based on the legacy one this was a bug.

The issue is not hit frequently in prod since you need a large cluster with ILB subsetting to see it. Conditions:

  • cluster with above 1k nodes
  • ILB subsetting service with etp: Local
  • be unlucky so that all pods are on nodes NOT added to the IG

Result: no pods get traffic

This could still be a problem for with more than 1k nodes in one zone.

But even excluding the customer issue, the difference between legacy and ingress-gce + the intent from the comments warrant this fix

mmamczur avatar Jun 25 '24 07:06 mmamczur

What is the error in the original code path? What this bug being hit frequently in production but not noticed or an edge condition that we finally saw?

Original code path. Occurs only at certain scale, only noticeable with services with ExternalTrafficPolicy: Local

Under what conditions will we encounter this error?

1K+ nodes, ExternalTrafficPolicy: Local, pods added only on trimmed nodes.

How will the result change between the old behavior and the new behavior?

we will limit to 1k nodes on per zone basis, exactly as we do in cloud-controller-gcp code managing instance groups

What kind of testing did we add to make sure the behavior did not change other than what we want to change?

  • unit tests for IG management that only passes with the new code
  • all the remaining unit tests are passing unchanged
  • we have robous E2E tests that mixes all possible service types, employs all our controllers

cezarygerard avatar Jun 25 '24 08:06 cezarygerard

@bowei

Please fix your commit message (typos). Would be good to give a bit more context on the change:

added more context

but where is a typo in the original message: "the is an error in truncating to 1000 nodes before iterating over zones when adding/removing instances from instance group"

?

cezarygerard avatar Jun 25 '24 09:06 cezarygerard

  1. I think my "typo" comment threw off the main point of my comment which is to take the above detailed explanations for the fix and put them in the commit message to describe what the change was. I'm not sure it's obvious what the context of the commit is apart from reading the discussion on Github.

  2. If the fix references an already existing implementation, please include a link or reference to that implementation in the commit message.

  3. How was this tested? Did we do an integration test up to the IG size limit where the controller will start behaving differently?

bowei avatar Jun 26 '24 05:06 bowei

Can the metrics change be done as a follow on? Or do we need to have this go in in one commit?

bowei avatar Jul 16 '24 20:07 bowei

Can the metrics change be done as a follow on? Or do we need to have this go in in one commit?

yes, please.

Added comment with link to cloud-provider-gcp's Legacy L4 ILB Controller code

Internally we have run manual tests to check consistency between legacy cloud-provider-gcp's Legacy L4 ILB Controller and this new logic and they look good

cezarygerard avatar Jul 24 '24 14:07 cezarygerard

I have also added internal ticket to follow up on metrics, linking to this PR and Swetha's comments

cezarygerard avatar Jul 24 '24 14:07 cezarygerard

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cezarygerard, swetharepakula

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [cezarygerard,swetharepakula]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 24 '24 19:07 k8s-ci-robot