Stage shows Running verifications even when all are Failed
Description
All my AnalysisRuns are Failed:
$ k get ar
NAME STATUS AGE
dev.01j02dz4tbp75sf7zmvgkw3d3j.9e5f6ab Failed 44h
dev.01j02hycta7xsjz7xczamptk6c.9d167fb Failed 43h
dev.01j047c0f8qcgq8fyj3mn6sa0r.ab22279 Failed 28h
dev.01j03w1h1njwvj432nbh8m7xvk.a72c52c Failed 31h
dev.01j03wxxntp3a3xm5pbpdzzzkd.bd13218 Failed 31h
dev.01j047080w47q6bvgdaqc11tws.d8df953 Failed 28h
dev.01j046aze40vzfvbwfc3mpvpc4.1bda081 Failed 28h
dev.01j03z1dzstgnx5tw85ty3apxr.97465d1 Failed 30h
dev.01j05679g7kccfehb2spejqcx7.151cea4 Failed 19h
dev.01j059yc9xmrej9hna0yspwxsz.3e53240 Failed 18h
dev.01j05me1tzty93z5mgvgpykvp3.1128f45 Failed 15h
Yet the UI still shows some as Running:
Steps to Reproduce
I have a feeling https://github.com/akuity/kargo/issues/2142 may have something to do with this (e.g. we are only processing the last one)
Version
v0.7.0
Logs
Paste any relevant application logs here.
@jessesuen and I spoke about this offline and it looks like this may be a backend bug. The UI uses the raw Status of the AnalysisRun to generate the icon, so there shouldn't be a way for the mismatch between CLI/UI above to occur. It's possible this was already fixed inadvertently
Correct @jessesuen , not sure but this could be related or fixed with https://github.com/akuity/kargo/issues/2128
UI relies on stage's freight histroy which has verification history. When user runs multiple promotions for same stage, and verifications for those run in parallel without waiting on one-another. Some of the previous verifications status freeze in stage's status but if you open the details of the verification from the UI, the status is correct and exactly what you see when you run k get ar for the particular analysis.
There are multiple solutions to this that I can think of.
First one being, we fix (sorry for this word, this might be expected behaviour please enlighten me if I am wrong) the verification history such that any previous analysis run statuses are available in history.
Second, UI should stop relying on freight history. It can directly query the AnalysisRun resources by labels filter. But I am not sure to rely on uniqueness of labels.
In any case, I would like to hear your thoughts - @jessesuen @krancour @hiddeco
Controller bug. It freezes AR status in the Stage's status -> freightHistory -> verificationHistory. I have reproducible.
- Create dummy warehouse and stage.
- Create 2 freights. Create dummy Analysis Template that takes around 10-20 seconds (doesn't matter as long as you transition promotion without one to finish).
- Now promote first freight to stage, immediately promote second freight to stage and once again immediately promote first to stage. Check verification history or read from YAML
kubectl describe stageand then Inside status field
I looked into this, and the problem is that the reconciler only looks at the AnalysisRun for the current Freight and not any previous ones that may have been replaced due to #2128.
While solving #2128 would cure the symptoms, we need to change the logic in the reconciler to ensure it's more resilient to these types of logical issues. For example, by listing all AnalysisRuns for a Stage and then processing their results.
Not sure if this is related but having any pending verifications in history also disables "Reverify", making it impossible to reverify stage via UI. It still works if you do the annotation manually