rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Grafana dashboard: add an uptime panel to overview

Open guoard opened this issue 1 year ago • 10 comments

Proposed Changes

This pull request adds an uptime panel to the RabbitMQ overview Grafana dashboard. By incorporating this feature, users can easily track the uptime of RabbitMQ instance.

Types of Changes

  • [ ] Bug fix (non-breaking change which fixes issue #NNNN)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • [ ] Documentation improvements (corrections, new content, etc)
  • [ ] Cosmetic change (whitespace, formatting, etc)
  • [ ] Build system and/or CI

Checklist

  • [x] I have read the CONTRIBUTING.md document
  • [x] I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] All tests pass locally with my changes
  • [ ] If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • [ ] If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

guoard avatar Mar 18 '24 07:03 guoard

Thanks a lot for contributing. Unfortunately it doesn't work well as currently implemented. If you restart the pods, they get a new identity in this panel - rather than an updated (shorter) uptime, you will see multiple rows for each pod: Screenshot 2024-03-18 at 10 04 59

To reproduce the problem, just kubectl rollout restart statefulset foo and check the dashboard afterwards.

If you can fix this, I'm happy to merge.

mkuratczyk avatar Mar 18 '24 09:03 mkuratczyk

@mkuratczyk thank you for your time. I pushed another commit, that should fix the problem in k8s statefulset.

guoard avatar Mar 19 '24 11:03 guoard

I'm afraid it still doesn't work when node restarts happen (which is kind of the whole point). Looking at a cluster that went through multiple node restarts I see this:

Screenshot 2024-03-21 at 08 38 55

mkuratczyk avatar Mar 21 '24 07:03 mkuratczyk

I conducted several tests on a 2-node k3s cluster with 5 instances of RabbitMQ, but I couldn't replicate the issue you described. However, I'm keen to assist further.

Firstly, could you kindly verify that the Prometheus query used matches the following:

rabbitmq_erlang_uptime_seconds * on(instance, job) group_left(rabbitmq_cluster) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}

Assuming the query aligns, it would be immensely helpful if you could provide additional details or steps that may aid in reproducing the issue. This could include specific configurations, environmental factors, or any other relevant information that might shed light on the problem. Thank you in advance for your assistance in resolving this matter.

guoard avatar Mar 24 '24 11:03 guoard

this the manifest I used to run Rabbitmq cluster:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: foo
spec:
  replicas: 5
  service:
    type: NodePort

guoard avatar Mar 24 '24 11:03 guoard

I can reproduce this even with a single node - just deploy it and the delete the pod to make it restart. It gets a new IP address and "becomes a new instance" (you can see the difference in the labels): Screenshot 2024-03-25 at 08 46 35

mkuratczyk avatar Mar 25 '24 07:03 mkuratczyk

Thank you for providing additional details.

I haven't faced this issue as my monitoring setup operates outside the Kubernetes cluster, with the instance label manually defined.

It appears challenging to correlate the rabbitmq_erlang_uptime_seconds metric with rabbitmq_identity_info without a unique label on the rabbitmq_identity_info metric. Without this, mapping seems unfeasible.

If you agree with my assessment, please consider closing the PR.

guoard avatar Mar 27 '24 05:03 guoard

I think uptime would be indeed valuable on the dashboard and I'm sure we can solve the query problem. I converted this to a draft PR and will have a look at fixing this problem when I have more time.

mkuratczyk avatar Mar 27 '24 08:03 mkuratczyk

What are your thoughts on adopting the following approach?

max(max_over_time(QUERY[$__interval]))

I'm unsure of the exact implementation details for the query at the moment. However, employing this method would enable us to track the maximum uptime within a specified custom interval.

guoard avatar Mar 28 '24 15:03 guoard

@mkuratczyk do you have an opinion on this approach? https://github.com/rabbitmq/rabbitmq-server/pull/10762#issuecomment-2025525365

michaelklishin avatar Apr 11 '24 22:04 michaelklishin