fleet icon indicating copy to clipboard operation
fleet copied to clipboard

CLI Troubleshooting and Diagnostics

Open manno opened this issue 4 months ago • 0 comments

As an On-call Engineer or Developer, I want to run a comprehensive diagnostic CLI command that helps troubleshoot common issues and generate information for support tickets.

First and foremost, the CLI should gather relevant diagnostic information, like resource statuses, Fleet's k8s events and logs.

Additional, detailed diagnostics could check agent registration, resource dependencies, and communication problems, so that root causes can be identified.

Acceptance Criteria:

  • Ability to generate a diagnostic bundle (logs, resource YAMLs) for offline analysis and sharing with support.

    • https://fleet.rancher.io/troubleshooting#where-to-look-for-root-causes-of-issues
    • the bundle should snapshot fleet resources over a period of 30s (?)
    • k8s events
    • also dump the metrics
  • A fleet troubleshoot or fleet doctor command acts as the main entry point to validate the current cluster.

    • Fetch metrics and analyze queue values (are queues healthy)?
    • provide a summary of checks performed (pass/fail), highlights potential issues, and suggests corrective actions or further commands for deeper investigation (e.g., "try kubectl logs -n cattle-fleet-system fleet-agent-<pod-id> on cluster X").

manno avatar Aug 20 '25 13:08 manno