baseplate.py icon indicating copy to clipboard operation
baseplate.py copied to clipboard

Reduce cardinality of http_server_active_requests

Open SuperQ opened this issue 2 years ago • 2 comments

Closes #

💸 TL;DR

Set the multiprocess mode on http_server_active_requests to "livesum" to reduce the cardinality to one metric per server.

📜 Details

Design Doc

Jira

🧪 Testing Steps / Validation

✅ Checks

  • [ ] CI tests (if present) are passing
  • [ ] Adheres to code style for repo
  • [ ] Contributor License Agreement (CLA) completed if not a Reddit employee

SuperQ avatar Sep 21 '22 17:09 SuperQ

Yup, that's fine, I'll make a proposal to update the spec.

SuperQ avatar Sep 21 '22 20:09 SuperQ

Probably all of the active_requests metrics. I don't think they're useful on a per-pid basis.

SuperQ avatar Sep 21 '22 21:09 SuperQ

Hey can we actually get this implemented on all Gauge metrics and try to expedite this into a new release? For services that are still deployed on legacy EC2 / puppet infrastructure not having a live* multiprocessing mode is accumulating *.db files and thousands of metrics due to autoreload on these servers that create new PIDs. This is causing scrape times to balloon up to 1.5-2 seconds and is causing event loop contention with the main application server.

garretthoffman avatar Sep 28 '22 23:09 garretthoffman

@garretthoffman I'm not sure this will actually help the situation you're seeing. Even with livesum, we still need to read all the *.db files in order to produce the aggregation.

I haven't looked closely at the Python implementation, but when we did similar things in Ruby, we avoided using the real system PID value, but a worker number. This way when you roll worker processes, they recycle the same identifier. This avoided the infinite generating of new *.db files every time a worker restarted.

I thought we had went with this implementation in baseplate.py.

SuperQ avatar Sep 29 '22 07:09 SuperQ

@SuperQ Understood about the *.db and well get in the other fix you committed. Just for some more information about 9600 out of 10000 metrics were from these Guage PIDs on some samples I checked so anything to reduce the cardinality on these would be great. Worker number seems extremely reasonable.

garretthoffman avatar Sep 29 '22 13:09 garretthoffman

I have the baseplate spec change approved. Just waiting for it to be merged.

SuperQ avatar Sep 29 '22 15:09 SuperQ

Are we going to make this change for the thrift server and client Gauges as well? Or do we want to handle that in a separate PR?

garretthoffman avatar Sep 29 '22 16:09 garretthoffman

I can follow up with more, there are a lot of other Gauges that I need to go through.

SuperQ avatar Sep 29 '22 17:09 SuperQ

@garretthoffman I've updated the PR to add livesome to most of the Gauge metrics.

SuperQ avatar Oct 05 '22 12:10 SuperQ

@garretthoffman I've updated the PR to add livesome to most of the Gauge metrics.

Appreciate this. The two offending metrics for minsky though are thrift_server_active_requests and cassandra_client_active_requests which are not updated here. Is there a reason that these can't be updated?

garretthoffman avatar Oct 05 '22 13:10 garretthoffman

Done, added those as well.

SuperQ avatar Oct 05 '22 14:10 SuperQ