elemental icon indicating copy to clipboard operation
elemental copied to clipboard

e2e: Run an extra upgrade after the first one

Open Itxaka opened this issue 3 years ago • 2 comments

Onmy local tests I have noticed that after running one upgrade, if another one is added with the same osImage the upgrade doesnt run.

I think this is due the hash of the data that the Plan runs, which results in the secret or plan being the same hash that already run, so no jobs are triggered for the new plan.

This should not be an issue for normal versioned osImage, but if you ever do a upgrade to a version latest and then after some time run it again, you migth found out that it wont run.

I created a PR to improve the pan names and make them more specific than os-upgrader but if we had a test, even failing, it could be corraborated that this indeed works.

Itxaka avatar Nov 18 '22 14:11 Itxaka

I think this is due the hash of the data that the Plan runs, which results in the secret or plan being the same hash that already run, so no jobs are triggered for the new plan.

Mmmm this is interesting, so you mean that if the plan contents do not change (aka, image reference remains unchanged, but not the contents) the upgrade can't be executed again 🤔 Probably this actually not a bad thing, in fact this is one of the reasons not to use labels such as latest. Anyway this is a relevant point to clarify.

I guess this is not what you are seeing, but if I am not mistaken the upgrade only happens when the /etc/os-release differs from the previous one. In that case I understand the job got triggered though, that's why I think this is not what you are seeing.

davidcassany avatar Nov 18 '22 14:11 davidcassany

I think this is due the hash of the data that the Plan runs, which results in the secret or plan being the same hash that already run, so no jobs are triggered for the new plan.

Mmmm this is interesting, so you mean that if the plan contents do not change (aka, image reference remains unchanged, but not the contents) the upgrade can't be executed again thinking Probably this actually not a bad thing, in fact this is one of the reasons not to use labels such as latest. Anyway this is a relevant point to clarify.

I think that the secret-data (always called os-upgrade-data) does not change and the Plans are very heavily reliant on secrets+checksums of those. So Its checking the secret data but the Status checksum comes back the same.

This was an upgrade that finished:

time="2022-11-17T13:13:40Z" level=debug msg="PLAN STATUS HANDLER: plan=cattle-system/os-upgrader@21342, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2022-11-17T13:13:40Z LastTransitionTime: Reason:Version Message:}] LatestVersion:latest LatestHash:2fe7f23505a825baaa4b0afcc7388985291b958a0813524fbdf64151 Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"

Then I deleted the upgrade and added a new one, result:

time="2022-11-17T13:18:24Z" level=debug msg="PLAN GENERATING HANDLER: plan=cattle-system/os-upgrader@22689, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2022-11-17T13:18:24Z LastTransitionTime: Reason:Version Message:}] LatestVersion:latest LatestHash:2fe7f23505a825baaa4b0afcc7388985291b958a0813524fbdf64151 Applying:[m-79cfdb39-bc36-41d6-a2b2-5726b6bca234]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"

See that times are different (5 minutes between them, BUT the has of the status is the same.

So what Im thinking its that its either doing a checksum of the Version or of the secret data name or a mix/match of both of those and storing it somewhere?

I dont know, its not clear to me and when I applied a patch that changed the names of the upgrade resources, this issue did not happen to me again.

In any case, I have tried so many things with the upgrade in the last couple of days that this could be a non-issue, but I think that its a good thing to test AND confirmation if this is working as I suspect or its working as intended and it was a problem with my setup (I can still reproduce it with a fresh cluster so...)

And then I guess we can make a decision if this is supported or not, but IMO this could lead to all sort of reports of upgrade "not-working". I can easily see an image that has an issue but a fixed version and its pushed again, triggering this. Or those meta images that keep changing with a MAJOR.MINOR version (ejem ejem, sle/opensuse, ejem ejem) and you dont get to trigger an update, etc...

I guess this is not what you are seeing, but if I am not mistaken the upgrade only happens when the /etc/os-release differs from the previous one. In that case I understand the job got triggered though, that's why I think this is not what you are seeing.

Yes, this is a different issue. For that to happen, the bundle needs to be deployed, jobs created and pods deployed and SUC called, THEN it checks if the os-release its the same, which is pretty nice as that supposedly gives you a way of testing the upgrade process BUT not upgrade because they are the same.

Itxaka avatar Nov 18 '22 15:11 Itxaka