clusterctl tries to download latest tag when no release exists.
What steps did you take and what happened:
Ran clusterctl init after a new tag (v1.3.2) for the repo was created, but before a new release corresponding to the tag was created. The operation failed with an error:
Fetching providers
Error: failed to get provider components for the "cluster-api" provider: failed to read "core-components.yaml" from provider's repository "cluster-api": failed to get GitHub release v1.3.2: failed to read release "v1.3.2": GET https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/tags/v1.3.2: 404 Not Found []
What did you expect to happen:
Expected clusterctl to only attempt to install versions that have associated releases.
Anything else you would like to add:
This error comes from the go proxy implementation in clusterctl. GoProxy doesn't have a concept of 'releases' - it only cares about go tags. This behaviour is on by default in clusterctl. Once a tag is released it becomes available to clusterctl even if there is no release, resulting in a failure.
There are two workarounds.
- Pin the cluster-api version to an existing version:
clusterctl init --core cluster-api:v1.3
- Turn off GOPROXY for clusterctl when this error occurs
GOPROXY=off clusterctl init --core cluster-api:v1.3
/kind bug /area clusterctl
I don't think there's any reasonable action to take here - other than trying to reduce the window of time between tag creation and release - but the release team should be aware of this in case it pops up in future. @kubernetes-sigs/cluster-api-release-team
Probably a bad idea, but would it be an option to ignore versions if we get a 404 like in the error message shown above?
Fetching providers Error: failed to get provider components for the "cluster-api" provider: failed to read "core-components.yaml" from provider's repository "cluster-api": failed to get GitHub release v1.3.2: failed to read release "v1.3.2": GET https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/tags/v1.3.2: 404 Not Found []
(no idea how that could fit into our current code)
I don't think there's any reasonable action to take here - other than trying to reduce the window of time between tag creation and release - but the release team should be aware of this in case it pops up in future.
I think this will be reduced once we automated the release process further (currently someone with write access has to push the tag manually).
But with image promotion etc. I think there will always be a delay of ~ 1 hour or so.
The release automation should help with this but yes the delay between the tag getting pushed and the release being available will depend on the image building and promotion mechanism we have now.
In the future may might want to explore a release flow that will tag and create the github release at the same time. (Have all the assets ready when we need to tag the - this might come down to how we use cloudbuild to build the container images without pushing a tag.) We can explore these options in a future improvement.
For now the release automation that the release team is working on should help with reducing the delay between the tag and the release becoming available.
Probably a bad idea, but would it be an option to ignore versions if we get a 404 like in the error message shown above
not a terrible idea, especially if we're able to narrow it down to "tag exists but no release exists"
Relying on a shorter delay isn't reliable enough IMO. Even if we get it fully automated in one pipeline (which would be tricky given image promotion is still lengthy and requires a PR), the pipeline could still fail or get interrupted which leaves us with a broken clusterctl init and breaks any automation people might have built around it...
/triage accepted
Relying on a shorter delay isn't reliable enough IMO. Even if we get it fully automated in one pipeline (which would be tricky given image promotion is still lengthy and requires a PR), the pipeline could still fail or get interrupted which leaves us with a broken clusterctl init and breaks any automation people might have built around it...
The releases today - v1.3.4 & v1.2.11 - were impacted in exactly this way and it was hours between the publishing of the tag and the image. The delay could have been much longer, depending on circumstances but it was a build error so a real blocker.
I'm +1 for using 404 as a signal that the latest release isn't available yet.
I tried to dig a bit into the code, my TLDR is:
- It should be relatively easy to identify the
404at this part of our code - we could define a specific error and bubble that one up e.g. via:
and bubble that up at the above linked code viavar ErrNotFound = errors.New("404 Not Found")if ghErr, ok := err.(*github.ErrorResponse); ok { if ghErr.Response.StatusCode == http.StatusNotFound { return ErrNotFound } } - As next step we could exit early when we detect the 404 instead of retrying inside this if cmd/clusterctl/client/repository/repository_github.go#L333, this way the long waiting time for the timeout would be gone
Afterwards we have options to check for ErrNotFound and:
- At least return a human readable error and outline the workaround (setting
GOPROXY=off) - Or try to adjust our code to use the
latest - 1tag instead or try to fallback to detect without goproxy immediately (requires some refactoring)
/close
I think the fix in #8253 is a reasonable solution to this.
@killianmuldoon: Closing this issue.
In response to this:
/close
I think the fix in #8253 is a reasonable solution to this.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
This has cropped up a few times since merging #8253. There's probably room for a more holistic fix, rather than just improving the error message.
@killianmuldoon: Reopened this issue.
In response to this:
/reopen
This has cropped up a few times since merging #8253. There's probably room for a more holistic fix, rather than just improving the error message.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/help wanted
i will take care this one /assign
@killianmuldoon @sbueringer what is the idea here? if a release is not published we get the previous available? asking to see what y'all want
Technically we can get the latest release available by looking at the output of https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/latest — That said, I think the main goal was to avoid repeated calls to GitHub due to rate limits; if we start looking at all the tags and get the corresponding release anyway though, seems we might get again into limits?
@killianmuldoon @sbueringer what is the idea here? if a release is not published we get the previous available? asking to see what y'all want
That sounds like it would be an improvement that would help in most cases.
Something like:
- Get the list of tags from the go module proxy
- If the latest tag is a published release use it and continue successfully
- If the latest tag is not published release (i.e. 404) try the next most recent tag. We should also log what's happening so the user can see clearly.
- We should try this a limited number of times - e.g. 5 - terminate with an error.
@killianmuldoon @sbueringer what is the idea here? if a release is not published we get the previous available? asking to see what y'all want
That sounds like it would be an improvement that would help in most cases.
Something like:
- Get the list of tags from the go module proxy
- If the latest tag is a published release use it and continue successfully
- If the latest tag is not published release (i.e. 404) try the next most recent tag. We should also log what's happening so the user can see clearly.
- We should try this a limited number of times - e.g. 5 - terminate with an error.
What's the sort order of tags, btw? This feels ripe for edge cases.
In go modules versions are listed according to semantic versioning - it's the equivalent of:
go list -m -versions sigs.k8s.io/cluster-api
might be this isn't ideal, but very practical. let's improve the error message to the users, informing them that this problem might happen if a release is being cut.
this should be enough for users doing the quick start, but for other use cases IMO we should just sticks to what we are recommending in the book for clusterctl
Pinning the version provides better control over what clusterctl chooses to install (usually required in an enterprise environment). Version pinning should always be used when using image overrides, or when relying on internal repositories with a separated software supply chain, or a custom versioning schema.
let's improve the error message to the users, informing them that this problem might happen if a release is being cut.
I think one of the problems with this is that some providers might have long windows between cutting a tag and publishing a release. Some might even want to cut tags without having an associated release. The current clusterctl behaviour imposes a release model or process on all providers.
We improved the error message advising users to set GOPROXY=off in https://github.com/kubernetes-sigs/cluster-api/pull/8253, but we've had multiple reports since of this being disruptive and not well communicated.
IMO a solution like the one in https://github.com/kubernetes-sigs/cluster-api/issues/7889#issuecomment-1632823726 would be helpful in getting rid of 90 percent of these issues
ACK, I'm ok with it, let's hope someone picks up this work soon
i will remove my assignment a bit busy for now :(
/unassign /help
@cpanato: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
i will remove my assignment a bit busy for now :(
/unassign /help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@killianmuldoon @fabriziopandini So we want to change the defaultVersion here https://github.com/kubernetes-sigs/cluster-api/blob/c7b145049a4a3649e547a18b61c40f0bce5ddde0/cmd/clusterctl/client/repository/repository_github.go#L257 thus changing the latestContractRelease implementation by changing the latestPatchRelease implementation here https://github.com/kubernetes-sigs/cluster-api/blob/c7b145049a4a3649e547a18b61c40f0bce5ddde0/cmd/clusterctl/client/repository/repository_versions.go#L86 to return a list of let's say 5 latest releases , so that we get file https://github.com/kubernetes-sigs/cluster-api/blob/c7b145049a4a3649e547a18b61c40f0bce5ddde0/cmd/clusterctl/client/repository/repository_versions.go#L44 for them until we don't receive errNotFound or the list of these versions exhaust. Is the approach correct?
👋 @Dhairya-Arora01 that approach sounds like you're on the right track. are you still interested in working on this? would be a great improvement to have (in fact just came up in this latest release)
@cahillsf , sure i will work on this.. :smile:
/assign
This is still hurting CAPO users every time we do a release.
This looks a rather wider issue affecting other projects, like cluster-api-operator (xref to slack discussion).
TL;DR is CAPI operator uses clusterctl lib under the hood when trying to download manifests to deploy provider and creates a repo using clusterctl lib, which in turn hits this issue directly.
Meaning, any InfraProvider that CAPI operator supports with the pushed tag but unpublished release, will fail.
/cc @Fedosin @alexander-demicev
/assign