kargo icon indicating copy to clipboard operation
kargo copied to clipboard

Stage shows Running verifications even when all are Failed

Open jessesuen opened this issue 1 year ago • 3 comments

Description

All my AnalysisRuns are Failed:

$ k get ar
NAME                                     STATUS   AGE
dev.01j02dz4tbp75sf7zmvgkw3d3j.9e5f6ab   Failed   44h
dev.01j02hycta7xsjz7xczamptk6c.9d167fb   Failed   43h
dev.01j047c0f8qcgq8fyj3mn6sa0r.ab22279   Failed   28h
dev.01j03w1h1njwvj432nbh8m7xvk.a72c52c   Failed   31h
dev.01j03wxxntp3a3xm5pbpdzzzkd.bd13218   Failed   31h
dev.01j047080w47q6bvgdaqc11tws.d8df953   Failed   28h
dev.01j046aze40vzfvbwfc3mpvpc4.1bda081   Failed   28h
dev.01j03z1dzstgnx5tw85ty3apxr.97465d1   Failed   30h
dev.01j05679g7kccfehb2spejqcx7.151cea4   Failed   19h
dev.01j059yc9xmrej9hna0yspwxsz.3e53240   Failed   18h
dev.01j05me1tzty93z5mgvgpykvp3.1128f45   Failed   15h

Yet the UI still shows some as Running:

image

Steps to Reproduce

I have a feeling https://github.com/akuity/kargo/issues/2142 may have something to do with this (e.g. we are only processing the last one)

Version

v0.7.0

Logs

Paste any relevant application logs here.

jessesuen avatar Jun 12 '24 22:06 jessesuen

@jessesuen and I spoke about this offline and it looks like this may be a backend bug. The UI uses the raw Status of the AnalysisRun to generate the icon, so there shouldn't be a way for the mismatch between CLI/UI above to occur. It's possible this was already fixed inadvertently

rbreeze avatar Sep 19 '24 17:09 rbreeze

Correct @jessesuen , not sure but this could be related or fixed with https://github.com/akuity/kargo/issues/2128

UI relies on stage's freight histroy which has verification history. When user runs multiple promotions for same stage, and verifications for those run in parallel without waiting on one-another. Some of the previous verifications status freeze in stage's status but if you open the details of the verification from the UI, the status is correct and exactly what you see when you run k get ar for the particular analysis.

There are multiple solutions to this that I can think of.

First one being, we fix (sorry for this word, this might be expected behaviour please enlighten me if I am wrong) the verification history such that any previous analysis run statuses are available in history.

Second, UI should stop relying on freight history. It can directly query the AnalysisRun resources by labels filter. But I am not sure to rely on uniqueness of labels.

In any case, I would like to hear your thoughts - @jessesuen @krancour @hiddeco

Marvin9 avatar Sep 24 '24 16:09 Marvin9

Controller bug. It freezes AR status in the Stage's status -> freightHistory -> verificationHistory. I have reproducible.

  • Create dummy warehouse and stage.
  • Create 2 freights. Create dummy Analysis Template that takes around 10-20 seconds (doesn't matter as long as you transition promotion without one to finish).
  • Now promote first freight to stage, immediately promote second freight to stage and once again immediately promote first to stage. Check verification history or read from YAML kubectl describe stage and then Inside status field

Marvin9 avatar Sep 26 '24 20:09 Marvin9

I looked into this, and the problem is that the reconciler only looks at the AnalysisRun for the current Freight and not any previous ones that may have been replaced due to #2128.

While solving #2128 would cure the symptoms, we need to change the logic in the reconciler to ensure it's more resilient to these types of logical issues. For example, by listing all AnalysisRuns for a Stage and then processing their results.

hiddeco avatar Oct 07 '24 13:10 hiddeco

Not sure if this is related but having any pending verifications in history also disables "Reverify", making it impossible to reverify stage via UI. It still works if you do the annotation manually Screenshot 2024-10-16 at 09 16 42 Screenshot 2024-10-16 at 09 16 29

eplightning avatar Oct 16 '24 07:10 eplightning