fleet icon indicating copy to clipboard operation
fleet copied to clipboard

[SURE-9488] Fleet UI errors will simplify troubleshooting and reduce escalations

Open kkaempf opened this issue 11 months ago • 2 comments

SURE-9488

Request Description: Descriptive Error Messages in the Fleet UI

Expected behaviour: UI should provide more descriptive error messages where you can pinpoint where the issue is since Hosted Rancher logs are not accessible. Or allow viewing the Rancher logs in the UI

Actual behaviour: The message show like 'failed: 3/1time="2024-11-28T09:04:55Z" level=fatal msg="context canceled"' where it doesn't provide useful context where to check the issue.

Actual issue: fleet.yaml contains duplicate key entry and validation fails or sometimes the helm chart doesn't exist.

TODO

Let's check all the error conditions and add a "context", like "error in gitjob: ".

  • [ ] Maybe strip or format the timestamp, too so they look better.
  • [ ] Replace all "context canceled" error messages, because users don't understand "context canceled". This probably translates into "timeout waiting for gitjob to complete"
  • [ ] Revisit Failure and Readiness Conditions (see SURE-9488)

kkaempf avatar Jan 22 '25 08:01 kkaempf

Let's check all the error conditions and add a "context", like "error in gitjob: ". Maybe strip or format the timestamp, too. Replace context canceled with timeout?

This probably translates into "timeout waiting for gitjob to complete"

  • [ ] Failure and Readiness Conditions (see SURE-9488)

manno avatar Jan 22 '25 14:01 manno

I closed JIRA today since we've made good progress in 0.12.0

kkaempf avatar Mar 19 '25 13:03 kkaempf

/backport v2.11.2

manno avatar Apr 30 '25 08:04 manno

/backport v2.10.6

manno avatar Apr 30 '25 09:04 manno

System Information

Before Upgrade:

Rancher Version Fleet Version
v2.11.0 106.0.0+up0.12.0

Steps followed

  • Created a GitRepo using this fleet.yaml
  • Wait for the Job to be created.
  • GitRepo is in Error state (see below screenshot for more details)
      Job Failed. failed: 1/1time="2025-05-19T18:27:31Z" level=fatal msg="failed to process bundle: context canceled"
    
Screenshot showing Error message.

Image


After Upgrade

Rancher Version Fleet Version
v2.12-49289cc9c6590b361d64950977dd20b1214908d7-head 107.0.0+up0.13.0-alpha.3
  • After upgrade, error message is clearly stating what exactly cause of failure.
  • See screenshot for exact error message.
      	Failed to process bundle: failed reading resources for "rke-monitoring/app": loading directory .chart/2ace3fcaa23682ab77cf7bdcd5a6df94dbc1d2e2a3a25bd63a1d4b82a0fde0d1, rke-monitoring/app: helm chart download: failed to do request: Head "https://registry01.suse/v2/helm/rancher-monitoring/manifests/106.0.1_up66.7.1-rancher.10": dial tcp: lookup registry01.suse on 10.43.0.10:53: no such host
    
Screenshot showing Clear error message

Image


sbulage avatar May 19 '25 18:05 sbulage