Stage's `.status.phase` is forever in `Promoting` with no running Promotions
Description
I noticed in one instance, a bunch of Stages were in stuck Promoting state.
$ k get stages
NAME SHARD CURRENT FREIGHT HEALTH PHASE AGE
prod NotApplicable 34h
prod-central a1e76d1acaff48174da7b3abb938d57c7f07af85 Unhealthy NotApplicable 34h
prod-west a1e76d1acaff48174da7b3abb938d57c7f07af85 Unhealthy NotApplicable 34h
prod-east a1e76d1acaff48174da7b3abb938d57c7f07af85 Unhealthy Steady 34h
ab-test-a f40255e4e3959d5c713d0454f8df22b6aa072008 Healthy Promoting 34h
ab-test-b ade77c672f509413e774de167f2caf5319e427c3 Healthy Promoting 34h
staging 4e0c9f8d4c0d7f8cbed96b17dbb4bee01aa60511 Healthy Promoting 34h
dev 778a4b2cd6bcbde5da6d9eb8cb242ce6941c2cb4 Healthy Promoting 34h
This is despite not having any running promotions.
$ k get promotions | grep Running
$
Steps to Reproduce
Version
v0.5.0
Logs
Paste any relevant application logs here.
Given the lack of logs, do you have any gut feeling on how this could potentially be reproduced, or if the last Promotion for the stuck Stages resulted in e.g. an error? As it almost appears like the Stage reconciler never kicks off again.
@jessesuen did you expect the shard column to be blank for all of those?
The v0.4.0 --> v0.5.0 upgrade logic should have accounted for copying the value of the shard label to the new shard field.
If you're in a sharded topology and something has gone wrong with that process, it is possible that all those stages are no longer being reconciled, which would explain why they all appear to be stuck.
This issue is quite old and things have changed a lot since it was opened. I can only assume this behavior is not still being observed. But @jessesuen please re-open if you know this to still be an issue.