dramatiq icon indicating copy to clipboard operation
dramatiq copied to clipboard

Can't start Prometheus middleware in prod container

Open dlip opened this issue 8 months ago • 8 comments

It works fine locally, but in prod _run_exposition_server is never called. after_process_boot is called though.

I have tried running with -p 1 -t 1 and --use-gevent but it didn't help.

dlip avatar Apr 29 '25 05:04 dlip

Same here @dlip...

It's similar to #297

During my tests, I was able to reproduce the issue locally by limiting my container CPU to the same amount as my production pod. (Using -p 1 -t 1 locally works, but it doesn't in Kubernetes. idk why)

In production, I use Kubernetes, while locally I use Docker to test same img.

cpu: '2400m' memory: '2048Mi'

I'm currently conducting further tests to better understand what might be happening in these scenarios. It seems like it could be a CPU-related problem during process forking, possibly a race condition.

PS: I use a custom Prometheus middleware to alter the metric names based on a orignal middleware. Additionally, before the recent addition of some actors, the application was serving the HTTP metrics server and now don't.

guedesfelipe avatar May 06 '25 20:05 guedesfelipe

I've dug into the code a bit more:

The call from cli.py to canteen_get to get the list of middleware to fork results in an empty list because it calls the wait function which determines canteen.initialized = False

The confusing part is canteen_try_init is called and cv.initialized = True. My guess is there is something wrong in with cv.get_lock() which is part of contextmanager, not modifying the object correctly

I can hack it to work by changing canteen_get to return ['dramatiq.middleware.prometheus:_run_exposition_server'] which I think proves this context has an issue

dlip avatar May 08 '25 03:05 dlip

I think I found my problem @dlip, it's because of this line: https://github.com/Bogdanp/dramatiq/blob/master/dramatiq/cli.py#L522... Our projects take more than 30 seconds to start up completely. And this explains why when I increase the CPU of the pods in some circumstances it starts working again.

Maybe it would be interesting to pass a parameter for this timeout and not have it be something fixed, what do you think?

@karolinepauls @synweap15 @Bogdanp (Srry for the ping 🤍)

guedesfelipe avatar May 08 '25 14:05 guedesfelipe

Hey @guedesfelipe, thanks for the help with debugging this issue. I think it would be good to have the timeout value parametrized, with the default as it is currently - 30 seconds.

What do you think @Bogdanp? I could take on it.

synweap15 avatar May 08 '25 14:05 synweap15

I can help develop and test if needed too

guedesfelipe avatar May 08 '25 14:05 guedesfelipe

@guedesfelipe Unfortunately that doesn't seem to be the issue for me, even setting it to 120s didn't work. I don't think my app takes 30s to startup, especially on a single thread/process.

I also tried setting multiprocessing.set_start_method to spawn, fork and forkserver without success

dlip avatar May 09 '25 01:05 dlip

What a pity @dlip , can you share your code to reproduce it and we can help you with this?

guedesfelipe avatar May 10 '25 13:05 guedesfelipe

@guedesfelipe unfortunately not, its proprietary work code. I'll probably just use this ugly hack until I find more time to test it, sorry if it gives you nightmares 😆

DRAMATIQ_FILE="/usr/local/lib/python3.12/site-packages/dramatiq/canteen.py"
if [[ -f "${DRAMATIQ_FILE}" ]]; then
  ! grep -q 'return ["dramatiq.middleware.prometheus:_run_exposition_server"]' "${DRAMATIQ_FILE}" && sed -i '/def canteen_get(canteen, timeout=1):/a\    return ["dramatiq.middleware.prometheus:_run_exposition_server"]' "${DRAMATIQ_FILE}"
fi

dlip avatar May 13 '25 00:05 dlip