AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[Feature] Documentation on troubleshooting Fleet Manager

Open tillig opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe.

If you deploy something to Fleet Manager that fails to replicate for some reason (e.g., maybe it's valid by the CRD schema but an admission controller in a member cluster rejects it) it's hard to see when things got replicated, what the status is, etc. There's no documentation explaining how to troubleshoot issues like this, or how to see Fleet Manager logs to determine if it tried to replicate something and possibly failed for some reason.

Describe the solution you'd like

Documentation that explains how to do things like:

  • Get the logs for Fleet Manager replication (ClusterResourcePlacement executing updates in members, for example).
  • Troubleshoot potentially common replication issues (deploy to hub cluster and member cluster rejects it due to an admission controller)
  • Somehow check replication status across member clusters to see if everything is in sync.

Describe alternatives you've considered

There really aren't alternatives. The docs on internals are pretty thin and the logs that get sent to Log Analytics don't seem to have this information, at least not that I can find. It's pretty hard to reverse engineer what's going on here.

tillig avatar Jun 08 '23 14:06 tillig

  • (In progress) Get the logs for Fleet Manager replication (ClusterResourcePlacement executing updates in members, for example).

  • (DONE) Troubleshoot potentially common replication issues (deploy to hub cluster and member cluster rejects it due to an admission controller)

    • Troubleshooting guide (Azure): https://learn.microsoft.com/en-us/troubleshoot/azure/kubernetes-fleet/troubleshoot-clusterresourceplacement-api-issues
    • Troubleshooting guide (Upstream): https://github.com/Azure/fleet/tree/main/docs/troubleshooting
  • (DONE) Somehow check replication status across member clusters to see if everything is in sync.

    • You can see the replication status of each target member cluster in the status of ClusterResourcePlacement.

circy9 avatar Aug 27 '24 18:08 circy9

Link in the Fleet Manager table of contents at the bottom ("Troubleshooting") currently links to the GitHub repo and not to the rendered docs. Might be good to get that updated.

tillig avatar Aug 27 '24 19:08 tillig

Thanks @tillig for the final pointer. This will be resolved in an upcoming Fleet documentation update. I'm going to shift this item to done.

sjwaight avatar Sep 25 '24 04:09 sjwaight

Hi @tillig - documentation update's published and the troubleshooting now points to the Learn docs instead of the GitHub repo. I'm going to go ahead and close this. We will continue to focus on improving our documentation, so please don't hesitate to flag any issues or provide feedback on it!

sjwaight avatar Oct 17 '24 23:10 sjwaight