dramatiq icon indicating copy to clipboard operation
dramatiq copied to clipboard

Prometheus endpoint stops responding after random amount of time

Open kuba-lilz opened this issue 2 years ago • 0 comments

Issues

Sometime after running, without any discerning pattern to the period (anywhere between few hours and two weeks), prometheus endpoint on my dramatiq process stops responding.

Looking at logs I can see that:

  • workers are healthy and continue to process tasks
  • connections to prometheus endpoint fail with Connection refused while connecting to upstream (logs taken by nginx reverse proxy)

OS: Debian 10 (buster) Python version: 3.8.7 dramatiq version: 1.12.0 prometheus-clien version: 0.12.0

Checklist

  • [x] Does your title concisely summarize the problem?
  • [ ] Did you include a minimal, reproducible example?
  • [x] What OS are you using?
  • [x] What version of Dramatiq are you using?
  • [x] What did you do?
  • [x] What did you expect would happen?
  • [x] What happened?

What OS are you using?

OS: Debian 10 (buster)

What version of Dramatiq are you using?

dramatiq version: 1.12.0

What did you do?

Start workers with dramatiq command, nothing unusual here.

What did you expect would happen?

Prometheus endpoint should always be live. This is usually the case, but sometimes endpoint goes away and never responds again, even though workers are healthy and continue to process tasks. I thought maybe this is some sort of counter overflow problem, but it sometimes happens few hours after starting workers, and sometimes doesn't happen for two weeks straight, with constant workload during that period.

What happened?

prometheus endpoint doesn't return at all. Client connecting to it returns

*4944 upstream server temporarily disabled while connecting to upstream, request: "GET / HTTP/1.1"

When prometheus endpoint is healthy, I am able to log messages "GET / HTTP/1.0" 200 " from dramatiq.middleware.prometheus._metrics_handler, but there are no messages from handler present when endpoint fail.

This could be a problem with dramatiq or prometheus. I thought I would first inquire here to see if maintainers might have an idea of what the problem might be, before trying to reach out to prometheus.

I'm also willing to add some additional logging to get more information if that's possible. Maybe there is a logger I can hook up into to get more understanding of where code might be breaking and why?

I'm using prometheus endpoint for proxy as to whether my workers are live, so having this endpoint even though workers are healthy is troubling ^^;

kuba-lilz avatar Feb 08 '22 03:02 kuba-lilz