baseplate.py
baseplate.py copied to clipboard
Reduce cardinality of http_server_active_requests
Closes #
💸 TL;DR
Set the multiprocess mode on http_server_active_requests
to "livesum" to reduce the cardinality to one metric per server.
📜 Details
Design Doc
Jira
🧪 Testing Steps / Validation
✅ Checks
- [ ] CI tests (if present) are passing
- [ ] Adheres to code style for repo
- [ ] Contributor License Agreement (CLA) completed if not a Reddit employee
Yup, that's fine, I'll make a proposal to update the spec.
Probably all of the active_requests
metrics. I don't think they're useful on a per-pid basis.
Hey can we actually get this implemented on all Gauge
metrics and try to expedite this into a new release? For services that are still deployed on legacy EC2 / puppet infrastructure not having a live*
multiprocessing mode is accumulating *.db
files and thousands of metrics due to autoreload on these servers that create new PIDs. This is causing scrape times to balloon up to 1.5-2 seconds and is causing event loop contention with the main application server.
@garretthoffman I'm not sure this will actually help the situation you're seeing. Even with livesum
, we still need to read all the *.db
files in order to produce the aggregation.
I haven't looked closely at the Python implementation, but when we did similar things in Ruby, we avoided using the real system PID value, but a worker number. This way when you roll worker processes, they recycle the same identifier. This avoided the infinite generating of new *.db
files every time a worker restarted.
I thought we had went with this implementation in baseplate.py.
@SuperQ Understood about the *.db
and well get in the other fix you committed. Just for some more information about 9600 out of 10000 metrics were from these Guage PIDs on some samples I checked so anything to reduce the cardinality on these would be great. Worker number seems extremely reasonable.
I have the baseplate spec change approved. Just waiting for it to be merged.
Are we going to make this change for the thrift server and client Gauges as well? Or do we want to handle that in a separate PR?
I can follow up with more, there are a lot of other Gauges that I need to go through.
@garretthoffman I've updated the PR to add livesome to most of the Gauge metrics.
@garretthoffman I've updated the PR to add livesome to most of the Gauge metrics.
Appreciate this. The two offending metrics for minsky though are thrift_server_active_requests
and cassandra_client_active_requests
which are not updated here. Is there a reason that these can't be updated?
Done, added those as well.