hcloud-cloud-controller-manager icon indicating copy to clipboard operation
hcloud-cloud-controller-manager copied to clipboard

feat: idea to provide robot nodes without robot credentials

Open pmdroid opened this issue 1 year ago • 4 comments

This is not a complete implementation just a MVP and idea.

With the mock server a customer that does not want the cluster to know his robots credentials can use the cloud controller. All needed information for the cloud controller are already assigned to the node when its started.

If you consider this or a similar idea please let me know i am easily can adjust the implementation and add docs for this use case.

pmdroid avatar Jan 10 '24 16:01 pmdroid

Hey @pmdroid, thank you for your proposal!

You are right that with the labels you provided, the Node can be succesfully initialized without talking the the Robot API. There are three things that currently require ongoing access to the Robot API:

  • Node Shutdown status (your PR returns false -> Running) this can be used to automatically reschedule Pods to other Nodes
  • Node Exists (your PR returns true -> Exists) this is used to automatically delete Nodes when they are deleted in the Cloud Provider API, probably does not happen automatically anyway for Robot servers
  • For reconciling Load Balancer targets we need to know which Robot servers exists and what IPs they have.

The info that is provided through labels in your proposal would be used to initialize the Node and set some fields on it. The cloud-provider initialization process looks like this:

  1. Operator sets --cloud-provider=external on the Kubelet
  2. When kubelet registers the Node object with the control-plane it includes a taint node.cloudprovider.kubernetes.io/uninitialized: "NoSchedule"
  3. k/cloud-provider (used in HCCM) sees the "new" (tainted) Node
  4. It asks our code for the metadata, we try to match the Node against Cloud/Robot APIs and if we find a match return the Metadata
  5. k/cloud-provider verifies that the response is plausible (ie. does not conflict existing status.addresses)
  6. k/cloud-provider patches the Node object as follows:
 metadata:
   annotations:
     # Stable Annotations
+    node.kubernetes.io/instance-type: {{ .Metadata.InstanceType }}
+    topology.kubernetes.io/region: {{ .Metadata.Region }}
+    topology.kubernetes.io/zone: {{ .Metadata.Zone }}
     # Beta Annotations
+    beta.kubernetes.io/instance-type: {{ .Metadata.InstanceType }}
+    failure-domain.beta.kubernetes.io/region: {{ .Metadata.Region }}
+    failure-domain.beta.kubernetes.io/zone: {{ .Metadata.Zone }}
 spec:
+  providerID: {{ .Metadata.ProviderID }}
   taints:
-    - Key: "node.cloudprovider.kubernetes.io/uninitialized"
-      Effect: "NoSchedule"
-      Value: "true"
 status:
+  addresses: {{ .Metadata.Addresses }}

Instance (cloud/instances.go)

Instead of moving this data through HCCM, what do you think about making the changes to the Node object directly? The taint could be removed by just not setting --cloud-provider=external in Kubelet.

If no Robot Credentials are supplied, InstanceExists and InstanceShutdown both return errors for the node, which would be logged but no other action would be taken.

Load Balancer targets (internal/hcops.ReconcileHCLBTargets)

This leaves the issue of this API call:

	if l.Cfg.Robot.Enabled {
		dedicatedServers, err := l.RobotClient.ServerGetList()
		if err != nil {
			return changed, fmt.Errorf("%s: failed to get list of dedicated servers: %w", op, err)
		}

		for _, s := range dedicatedServers {
			robotIPsToIDs[s.ServerIP] = s.ServerNumber
			robotIDToIPv4[s.ServerNumber] = s.ServerIP
		}
	}

At a quick glance I believe we can replace this by parsing the required information from the Kubernetes Nodes (.status.addresses, .spec.providerID).

apricote avatar Jan 15 '24 08:01 apricote

While talking to someone from the Robot Team I learned that you can also configure a special "Webservice User" which only has access to the Robot Webservice, not your full Hetzner Account. Perhaps this is also a good alternative for you. (See #608)

apricote avatar Jan 15 '24 11:01 apricote

hey @apricote, thanks for the answer!

I implement this to prevent the HCCM from removing a node from the cluster that is ready just not managed by the cloud provider. This happens because the HCCM cannot find any Node that matches the name.

It might be nice to have something similar to this that makes it possble to use the Cloud Nodes and other Nodes together but keep the other Nodes away from the Cloud Provider maybe with a Flag "instance.hetzner.cloud/self-managed"?

Thanks for the information about the "Webservice User" but even this is still to verbose when i comes to the permission, but the same applies also to the Cloud Token. It would be nice if its possible to create token that are limited to function, for example to read metadata and create and delete nodes (autoscaling).

pmdroid avatar Jan 16 '24 15:01 pmdroid

This PR has been marked as stale because it has not had recent activity. The bot will close the PR if no further action occurs.

github-actions[bot] avatar Apr 16 '24 12:04 github-actions[bot]