cluster-api-provider-packet Cluster Management fails when run from OSX

User Story

As a [developer/user/operator] I would like to [high level description] for [reasons]

Detailed Description

[A clear and concise description of what you want to happen.]

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

Jan 30 '23 17:01 displague

More specifically, we believe the issue is with docker desktop on MacOS.

Mar 09 '23 20:03 cprivitere

On macOS, using Docker Desktop, the machine controller fails to bring up devices in Equinix Metal. The logs are filled with messages like this:

E0309 17:22:22.216338       1 controller.go:317] controller/packetmachine "msg"="Reconciler error" "error"="failed to create scope: failed to get workload cluster client: failed to create client for Cluster default/my-cluster: Get \"https://139.178.81.91:6443/api?timeout=10s\": dial tcp 139.178.81.91:6443: connect: connection refused" "name"="my-cluster-control-plane-m29n2" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="PacketMachine"

The error message above comes from here: https://github.com/kubernetes-sigs/cluster-api-provider-packet/blob/main/controllers/packetmachine_controller.go#L130

Further investigation indicates that, for some reason, connections to any closed port from a container running in Docker Desktop on macOS fail with Connection refused instead of a connection timeout. Docker Desktop on macOS uses a VM to host the containers, and something about the configuration of that VM appears to cause containers to see closed ports on external services as open but non-responsive:

# From a terminal on the macOS host
$ curl google.com:43421
curl: (28) Failed to connect to google.com port 43421 after 75006 ms: Couldn't connect to server
# From an instance of mikefarah/yq:4.31.1 running in docker on the same macOS host
$ wget google.com:43421
Connecting to google.com:43421 (172.217.1.110:43421)
wget: can't connect to remote host (172.217.1.110): Connection refused

This behavior does not happen with Colima on macOS (which also runs containers in a host VM), so this issue appears to be specific to Docker Desktop on macOS.

We could potentially fix this by treating a Connection refused error the same way we treat a timeout in this code: https://github.com/kubernetes-sigs/cluster-api-provider-packet/blob/365fddba549cb5fa5b32ec469e5cfbb4d3481114/pkg/cloud/packet/scope/machine.go#L342-L355

However, that assumes that Connection refused is always a startup problem, and maybe we can't rely on that to be the case?

Mar 13 '23 18:03 ctreatma

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 16 '23 19:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 19 '24 21:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Feb 18 '24 21:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 18 '24 21:02 k8s-ci-robot

/reopen

Mar 19 '24 13:03 cprivitere

/remove-lifecycle rotten

Mar 19 '24 13:03 cprivitere

@cprivitere: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 19 '24 13:03 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 17 '24 13:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 17 '24 13:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Aug 16 '24 14:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 16 '24 14:08 k8s-ci-robot

cluster-api-provider-packet cluster-api-provider-packet copied to clipboard

Cluster Management fails when run from OSX

cluster-api-provider-packet
cluster-api-provider-packet copied to clipboard