cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

clusterctl tries to download latest tag when no release exists.

Open killianmuldoon opened this issue 2 years ago • 31 comments

What steps did you take and what happened: Ran clusterctl init after a new tag (v1.3.2) for the repo was created, but before a new release corresponding to the tag was created. The operation failed with an error:

Fetching providers
Error: failed to get provider components for the "cluster-api" provider: failed to read "core-components.yaml" from provider's repository "cluster-api": failed to get GitHub release v1.3.2: failed to read release "v1.3.2": GET https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/tags/v1.3.2: 404 Not Found []

What did you expect to happen: Expected clusterctl to only attempt to install versions that have associated releases.

Anything else you would like to add: This error comes from the go proxy implementation in clusterctl. GoProxy doesn't have a concept of 'releases' - it only cares about go tags. This behaviour is on by default in clusterctl. Once a tag is released it becomes available to clusterctl even if there is no release, resulting in a failure.

There are two workarounds.

  1. Pin the cluster-api version to an existing version:
clusterctl init --core cluster-api:v1.3
  1. Turn off GOPROXY for clusterctl when this error occurs
GOPROXY=off clusterctl init --core cluster-api:v1.3

/kind bug /area clusterctl

I don't think there's any reasonable action to take here - other than trying to reduce the window of time between tag creation and release - but the release team should be aware of this in case it pops up in future. @kubernetes-sigs/cluster-api-release-team

killianmuldoon avatar Jan 10 '23 15:01 killianmuldoon

Probably a bad idea, but would it be an option to ignore versions if we get a 404 like in the error message shown above?

Fetching providers Error: failed to get provider components for the "cluster-api" provider: failed to read "core-components.yaml" from provider's repository "cluster-api": failed to get GitHub release v1.3.2: failed to read release "v1.3.2": GET https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/tags/v1.3.2: 404 Not Found []

(no idea how that could fit into our current code)

sbueringer avatar Jan 10 '23 16:01 sbueringer

I don't think there's any reasonable action to take here - other than trying to reduce the window of time between tag creation and release - but the release team should be aware of this in case it pops up in future.

I think this will be reduced once we automated the release process further (currently someone with write access has to push the tag manually).

But with image promotion etc. I think there will always be a delay of ~ 1 hour or so.

sbueringer avatar Jan 10 '23 16:01 sbueringer

The release automation should help with this but yes the delay between the tag getting pushed and the release being available will depend on the image building and promotion mechanism we have now.

In the future may might want to explore a release flow that will tag and create the github release at the same time. (Have all the assets ready when we need to tag the - this might come down to how we use cloudbuild to build the container images without pushing a tag.) We can explore these options in a future improvement.

For now the release automation that the release team is working on should help with reducing the delay between the tag and the release becoming available.

ykakarap avatar Jan 11 '23 02:01 ykakarap

Probably a bad idea, but would it be an option to ignore versions if we get a 404 like in the error message shown above

not a terrible idea, especially if we're able to narrow it down to "tag exists but no release exists"

Relying on a shorter delay isn't reliable enough IMO. Even if we get it fully automated in one pipeline (which would be tricky given image promotion is still lengthy and requires a PR), the pipeline could still fail or get interrupted which leaves us with a broken clusterctl init and breaks any automation people might have built around it...

CecileRobertMichon avatar Jan 11 '23 02:01 CecileRobertMichon

/triage accepted

fabriziopandini avatar Jan 11 '23 14:01 fabriziopandini

Relying on a shorter delay isn't reliable enough IMO. Even if we get it fully automated in one pipeline (which would be tricky given image promotion is still lengthy and requires a PR), the pipeline could still fail or get interrupted which leaves us with a broken clusterctl init and breaks any automation people might have built around it...

The releases today - v1.3.4 & v1.2.11 - were impacted in exactly this way and it was hours between the publishing of the tag and the image. The delay could have been much longer, depending on circumstances but it was a build error so a real blocker.

I'm +1 for using 404 as a signal that the latest release isn't available yet.

killianmuldoon avatar Feb 28 '23 22:02 killianmuldoon

I tried to dig a bit into the code, my TLDR is:

  • It should be relatively easy to identify the 404 at this part of our code
  • we could define a specific error and bubble that one up e.g. via:
    var ErrNotFound = errors.New("404 Not Found")
    
    and bubble that up at the above linked code via
      if ghErr, ok := err.(*github.ErrorResponse); ok {
      	if ghErr.Response.StatusCode == http.StatusNotFound {
      		return ErrNotFound
      	}
      }
    
  • As next step we could exit early when we detect the 404 instead of retrying inside this if cmd/clusterctl/client/repository/repository_github.go#L333, this way the long waiting time for the timeout would be gone

Afterwards we have options to check for ErrNotFound and:

  • At least return a human readable error and outline the workaround (setting GOPROXY=off)
  • Or try to adjust our code to use the latest - 1 tag instead or try to fallback to detect without goproxy immediately (requires some refactoring)

chrischdi avatar Mar 03 '23 16:03 chrischdi

/close

I think the fix in #8253 is a reasonable solution to this.

killianmuldoon avatar May 23 '23 13:05 killianmuldoon

@killianmuldoon: Closing this issue.

In response to this:

/close

I think the fix in #8253 is a reasonable solution to this.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar May 23 '23 13:05 k8s-ci-robot

/reopen

This has cropped up a few times since merging #8253. There's probably room for a more holistic fix, rather than just improving the error message.

killianmuldoon avatar Jul 07 '23 08:07 killianmuldoon

@killianmuldoon: Reopened this issue.

In response to this:

/reopen

This has cropped up a few times since merging #8253. There's probably room for a more holistic fix, rather than just improving the error message.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jul 07 '23 08:07 k8s-ci-robot

/help wanted

killianmuldoon avatar Jul 07 '23 08:07 killianmuldoon

i will take care this one /assign

cpanato avatar Jul 07 '23 12:07 cpanato

@killianmuldoon @sbueringer what is the idea here? if a release is not published we get the previous available? asking to see what y'all want

cpanato avatar Jul 12 '23 14:07 cpanato

Technically we can get the latest release available by looking at the output of https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/latest — That said, I think the main goal was to avoid repeated calls to GitHub due to rate limits; if we start looking at all the tags and get the corresponding release anyway though, seems we might get again into limits?

vincepri avatar Jul 12 '23 15:07 vincepri

@killianmuldoon @sbueringer what is the idea here? if a release is not published we get the previous available? asking to see what y'all want

That sounds like it would be an improvement that would help in most cases.

Something like:

  1. Get the list of tags from the go module proxy
  2. If the latest tag is a published release use it and continue successfully
  3. If the latest tag is not published release (i.e. 404) try the next most recent tag. We should also log what's happening so the user can see clearly.
  4. We should try this a limited number of times - e.g. 5 - terminate with an error.

killianmuldoon avatar Jul 12 '23 16:07 killianmuldoon

@killianmuldoon @sbueringer what is the idea here? if a release is not published we get the previous available? asking to see what y'all want

That sounds like it would be an improvement that would help in most cases.

Something like:

  1. Get the list of tags from the go module proxy
  2. If the latest tag is a published release use it and continue successfully
  3. If the latest tag is not published release (i.e. 404) try the next most recent tag. We should also log what's happening so the user can see clearly.
  4. We should try this a limited number of times - e.g. 5 - terminate with an error.

What's the sort order of tags, btw? This feels ripe for edge cases.

mdbooth avatar Jul 18 '23 09:07 mdbooth

In go modules versions are listed according to semantic versioning - it's the equivalent of:

go list -m -versions sigs.k8s.io/cluster-api

killianmuldoon avatar Jul 18 '23 09:07 killianmuldoon

might be this isn't ideal, but very practical. let's improve the error message to the users, informing them that this problem might happen if a release is being cut.

this should be enough for users doing the quick start, but for other use cases IMO we should just sticks to what we are recommending in the book for clusterctl

Pinning the version provides better control over what clusterctl chooses to install (usually required in an enterprise environment). Version pinning should always be used when using image overrides, or when relying on internal repositories with a separated software supply chain, or a custom versioning schema.

fabriziopandini avatar Jul 19 '23 10:07 fabriziopandini

let's improve the error message to the users, informing them that this problem might happen if a release is being cut.

I think one of the problems with this is that some providers might have long windows between cutting a tag and publishing a release. Some might even want to cut tags without having an associated release. The current clusterctl behaviour imposes a release model or process on all providers.

We improved the error message advising users to set GOPROXY=off in https://github.com/kubernetes-sigs/cluster-api/pull/8253, but we've had multiple reports since of this being disruptive and not well communicated.

IMO a solution like the one in https://github.com/kubernetes-sigs/cluster-api/issues/7889#issuecomment-1632823726 would be helpful in getting rid of 90 percent of these issues

killianmuldoon avatar Jul 19 '23 10:07 killianmuldoon

ACK, I'm ok with it, let's hope someone picks up this work soon

fabriziopandini avatar Jul 19 '23 10:07 fabriziopandini

i will remove my assignment a bit busy for now :(

/unassign /help

cpanato avatar Sep 27 '23 10:09 cpanato

@cpanato: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

i will remove my assignment a bit busy for now :(

/unassign /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 27 '23 10:09 k8s-ci-robot

@killianmuldoon @fabriziopandini So we want to change the defaultVersion here https://github.com/kubernetes-sigs/cluster-api/blob/c7b145049a4a3649e547a18b61c40f0bce5ddde0/cmd/clusterctl/client/repository/repository_github.go#L257 thus changing the latestContractRelease implementation by changing the latestPatchRelease implementation here https://github.com/kubernetes-sigs/cluster-api/blob/c7b145049a4a3649e547a18b61c40f0bce5ddde0/cmd/clusterctl/client/repository/repository_versions.go#L86 to return a list of let's say 5 latest releases , so that we get file https://github.com/kubernetes-sigs/cluster-api/blob/c7b145049a4a3649e547a18b61c40f0bce5ddde0/cmd/clusterctl/client/repository/repository_versions.go#L44 for them until we don't receive errNotFound or the list of these versions exhaust. Is the approach correct?

Dhairya-Arora01 avatar Oct 06 '23 15:10 Dhairya-Arora01

👋 @Dhairya-Arora01 that approach sounds like you're on the right track. are you still interested in working on this? would be a great improvement to have (in fact just came up in this latest release)

cahillsf avatar Jan 18 '24 23:01 cahillsf

@cahillsf , sure i will work on this.. :smile:

Dhairya-Arora01 avatar Jan 19 '24 03:01 Dhairya-Arora01

/assign

Dhairya-Arora01 avatar Jan 19 '24 03:01 Dhairya-Arora01

This is still hurting CAPO users every time we do a release.

mdbooth avatar Jan 30 '24 15:01 mdbooth

This looks a rather wider issue affecting other projects, like cluster-api-operator (xref to slack discussion).

TL;DR is CAPI operator uses clusterctl lib under the hood when trying to download manifests to deploy provider and creates a repo using clusterctl lib, which in turn hits this issue directly.

Meaning, any InfraProvider that CAPI operator supports with the pushed tag but unpublished release, will fail.

/cc @Fedosin @alexander-demicev

furkatgofurov7 avatar Feb 27 '24 10:02 furkatgofurov7

/assign

Fedosin avatar Feb 27 '24 10:02 Fedosin