celery-exporter
celery-exporter copied to clipboard
Possibility of clearing metrics every X seconds (memory problem)
I am using version v0.9.2, with the variables CE_WORKER_TIMEOUT
and CE_PURGE_OFFLINE_WORKER_METRICS
modified, the time was changed to 20 seconds.
In my structure every X minutes, several nodes in batches are started in Kubernetes with dozens of pods/celery consuming X queues. Prometheus scrapes the metrics from the celery-exporter (9808/metrics) and stores them. Apparently the purge variables don't work very well in my structure. In the logs I see purge of 1, 2 pods after many hours.
Would you like to know if there is a possibility to add a new parameter to purge all /metrics every X seconds? Or any tips for another solution.
Thanks and crongrats on the great project.
@adinhodovic
If your workers go offline (rotate) metrics should be quickly cleaned up. Works fine for us with up to ~100 pods. On new releases all metrics get cleaned quite quickly. We do it every 5 minute and a worker times out at 2.5 minutes. You are not seeing the purge message enough?
Maybe CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true
will help with cardinality aswell?
we dont have an option to clean all metrics atm.
Hey,
I have same problem on my side,
I tried to activate CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true
and some metrics has their hostname set as generic
but there is still other that are labelled with pod name. I also tried to cutomize CE_PURGE_OFFLINE_WORKER_METRICS
and CE_WORKER_TIMEOUT
as well but there is no purge.
I tried to find how garbage collecting is working and I think i partially found the cause :
-
self.track_timed_out_workers()
is called at every scrap. - This method will iterate on
self.worker_last_seen
to callself.purge_worker_metrics()
On my side, problem is that self.worker_last_seen
remains empty and it never get updated so metrics are never purged.
If your workers go offline (rotate) metrics should be quickly cleaned up. Works fine for us with up to ~100 pods. On new releases all metrics get cleaned quite quickly. We do it every 5 minute and a worker times out at 2.5 minutes. You are not seeing the purge message enough?
Maybe
CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true
will help with cardinality aswell?we dont have an option to clean all metrics atm.
What do you mean by go offline ? Is it a gracefull disconnection made by workers or something like that ? ( sorry for this question but I absolutely know nothing about celery )