gke-autoneg-controller icon indicating copy to clipboard operation
gke-autoneg-controller copied to clipboard

#109 overrides external managed capacity scaling

Open xinau opened this issue 1 year ago • 7 comments

Hey :wave:, the change introduced in #109 results in the capacity scaler setting being overridden when being set by an external tool. This is a breakage to the previous behavior and can be potentially be dangerous for production environments.

Before, when a service had been created with an initial_capacity of 50 the NEGs would have been added with a capacity of 50%. When the capacity got changed on the NEGs via an external tool, the controller during reconciliation would leave it as it is.

With #109, when the capacity is changed and the controller reconciles, the capacity is set back to the initial_capacity.

This can be easily compared with the latest version and v1.0.0.

  1. create a service with a neg config for "initial_capacity": 50
  2. wait for the controller to add the NEGs with 50% to the backend
  3. update the capacity in the console to 75%
  4. wait for the console to update the capacity to 75%
  5. wait for or force a reconciliation by the controller
  6. check the capacity of the NEGs

I do see merit in having an "overriding" annotation as well as externally manageable way. This can probably be best achived by

  • having a new capacity setting along the initial_capacity
  • create a flag for allowing overrides

Thanks for creating this tool.

cc @rosmo and @darkstarmv

xinau avatar Jan 09 '24 14:01 xinau

I can add a flag with default value of false to prevent sync, so original behavior is not changed

darkstarmv avatar Jan 09 '24 17:01 darkstarmv

My 5 cent. I think restoring the original behavior would be the sane thing to do regardless, or properly announcing the breakage in the next release.

I think it's important to name either the flag properly or use another field as initial_capacity clearly indicates that this is the initial value and thus it can be assumed that it won't be causing any effect afterwards.

xinau avatar Jan 10 '24 11:01 xinau

@xinau Yes, any changes will be announced properly. I wouldn't recommend using master branch built versions in production.

@darkstarmv I'd be fine with both implementation (initial_capacity or flag update_capacity (or similar)).

rosmo avatar Jan 29 '24 09:01 rosmo

@rosmo, Thanks for taking care of this.

xinau avatar Feb 07 '24 06:02 xinau

@darkstarmv Do you have some time to add flag or should I look into it?

rosmo avatar Feb 28 '24 08:02 rosmo

@rosmo I can work on it this week. My plan is:

  1. add controller.autoneg.dev/sync annotation, so users can set synchronization of K8s settings to Google per K8s service.
  2. If controller.autoneg.dev/sync: true - update capacityScaler with K8s service value
  3. If controller.autoneg.dev/sync: false or not present - do not update capacityScaler from k8s service value

darkstarmv avatar Mar 04 '24 19:03 darkstarmv

@darkstarmv Thanks, let me know if you need help.

rosmo avatar Mar 26 '24 10:03 rosmo

I went ahead and implemented the above, but with a slight twist, making the sync annotation more flexible. Please see #125.

The annotation would be controller.autoneg.dev/sync: '{"capacity_scaler":true}'

rosmo avatar Apr 17 '24 13:04 rosmo

@xinau Would you be able to test the current master? It should not override any existing settings.

rosmo avatar Apr 19 '24 09:04 rosmo