kops icon indicating copy to clipboard operation
kops copied to clipboard

Better output from kops rolling-update cluster command

Open UncleEricB opened this issue 3 years ago • 8 comments

/kind feature

1. Describe IN DETAIL the feature/behavior/change you would like to see. There are multiple reasons a k8s node can be in NeedsUpdate state. I want a more focused explanation of the trigger for nodes in an InstanceGroup being in NeedsUpdate state when kops rolling-update cluster is run, possibly at a verbosity around 4.

The reason for this request is that there are multiple (four) triggers for a node being in a NeedsUpdate state. That documentation doesn't clearly state how to check those possible causes. I guess "The instance was created with a specification that is older" refers to Launch Template versions? Maybe "The instance was detached" refers to a cordon Taint?

This will speed up debugging and improve uptime. It will also expand the pool of SREs capable of debugging as not everyone has the same level of kOps/k8s expertise.

2. Feel free to provide a design supporting your feature request. Preferred Output $ kops rolling-update cluster cactus-1-23.k8s.sproutsocial.com --state s3://infra-kops-state -v4 ~/sandbox/sprout_development_env/NeedsUpdateChecker I0812 11:52:07.404391 4005 factory.go:68] state store s3://infra-kops-state ...snip... I0812 11:52:10.825012 4005 aws_cloud.go:1551] Querying EC2 for all valid zones in region "us-east-1" I0812 11:52:10.826233 4005 request_logger.go:45] AWS request: ec2/DescribeAvailabilityZones I0812 11:52:11.322863 4005 aws_cloud.go:629] Listing all Autoscaling groups matching cluster tags I0812 11:52:11.324043 4005 request_logger.go:45] AWS request: autoscaling/DescribeTags I0812 11:52:11.841028 4005 request_logger.go:45] AWS request: autoscaling/DescribeAutoScalingGroups I0812 11:52:12.022521 4005 aws_cloud.go:743] Launch Template Version Specified By ASG: $Latest I0812 11:52:12.023747 4005 request_logger.go:45] AWS request: ec2/DescribeLaunchTemplates I0812 11:52:12.141730 4005 aws_cloud.go:762] Launch Template Version used for compare: "3" I0812 11:52:12.141732 4005 aws_cloud.go:764] InstanceGroup nodes-us-east-1a nodes Launch Template are behind! I0812 11:52:14.051511 4005 aws_cloud.go:743] Launch Template Version Specified By ASG: $Latest I0812 11:52:14.051654 4005 request_logger.go:45] AWS request: ec2/DescribeLaunchTemplates I0812 11:52:14.178106 4005 aws_cloud.go:762] Launch Template Version used for compare: "4" I0812 11:52:14.178108 4005 aws_cloud.go:765] InstanceGroup nodes-us-east-1b nodes have a Cordon Taint! I0812 11:52:14.532158 4005 aws_cloud.go:743] Launch Template Version Specified By ASG: $Latest I0812 11:52:14.532365 4005 request_logger.go:45] AWS request: ec2/DescribeLaunchTemplates I0812 11:52:14.647179 4005 aws_cloud.go:762] Launch Template Version used for compare: "4" I0812 11:52:14.647181 4005 aws_cloud.go:766] InstanceGroup nodes-us-east-1d nodes have needs-update annotation ...snip...

--or even-- NAME STATUS NEEDUPDATE READY MIN TARGET MAX NODES REASON master-us-east-1a Ready 0 1 1 1 1 1 master-us-east-1b Ready 0 1 1 1 1 1 master-us-east-1d Ready 0 1 1 1 1 1 nodes-us-east-1a NeedsUpdate 2 0 2 2 2 2 Launch Template version nodes-us-east-1b NeedsUpdate 2 0 2 2 2 2 Cordon Taint nodes-us-east-1d NeedsUpdate 2 0 2 2 2 2 kops.k8s.io/needs-update

UncleEricB avatar Aug 12 '22 19:08 UncleEricB

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 10 '22 19:11 k8s-triage-robot

I think these are good suggestions, but probably hard to prioritise for most of the maintainers. It should however be low-hanging fruit for new contributors.

olemarkus avatar Nov 20 '22 14:11 olemarkus

The places that need this logging:

func (group *CloudInstanceGroup) AdjustNeedUpdate() {

func getCloudGroups(c GCECloud, cluster *kops.Cluster, instancegroups []*kops.InstanceGroup, warnUnmatched bool, nodes []v1.Node) (map[string]*cloudinstances.CloudInstanceGroup, error) {

func awsBuildCloudInstanceGroup(c AWSCloud, cluster *kops.Cluster, ig *kops.InstanceGroup, g *autoscaling.Group, nodeMap map[string]*v1.Node) (*cloudinstances.CloudInstanceGroup, error) {

and any place that assigns the value CloudInstanceStatusNeedsUpdate

johngmyers avatar Nov 20 '22 19:11 johngmyers

/assign

I would be taking this issue @olemarkus

indevi avatar Dec 03 '22 05:12 indevi

Thanks for that.

I suggest writing user-facing text directly to stdout and not go through klog. The remaining klog lines could go through -v2.

olemarkus avatar Dec 04 '22 11:12 olemarkus

@olemarkus I have difficulty understanding what needs to be done in order to complete this task. Can you please break it down into steps

indevi avatar Dec 11 '22 04:12 indevi

The information that users should read should just be outputted with fmt.Printf(). The things that are less useful should use e.g klog.V(2).Infof().

olemarkus avatar Dec 11 '22 08:12 olemarkus

/remove-lifecycle stale

vaibhav2107 avatar Sep 27 '23 10:09 vaibhav2107