The application makes the `/metrics` endpoint slow with a large number of devices
Environment
- Python version: 3.11.7
- Nautobot version: 2.1.2
- nautobot-device-lifecycle-mgmt version: 2.0.3
Expected Behavior
I want to load the page relatively quickly, not block uwsgi for a significant amount of time.
Observed Behavior
I have Prometheus configured to scrape the Nautobot /metrics endpoint every minute. It takes about ~30s to load the page with ~1500 devices. This request blocks uwsgi, eventually making K8s liveness and readiness checks fail and K8s restart the pods.
I performed the analysis with the following piece of code (thx @Kircheneer ):
from django.test import RequestFactory
from nautobot.core.views import nautobot_metrics_view
import cProfile
import pstats
factory = RequestFactory()
request = factory.get("/metrics")
request.user = User.objects.get(username="some-poor-fellow")
with cProfile.Profile() as pr:
response = nautobot_metrics_view(request)
stats = pstats.Stats(pr).sort_stats(pstats.SortKey.CUMULATIVE)
stats.print_stats()
I noticed that this function is called 3 times, each taking ~12s:
3 0.042 0.014 36.174 12.058 /usr/local/lib/python3.11/site-packages/nautobot_device_lifecycle_mgmt/metrics.py:115(metrics_lcm_hw_end_of_support)
Steps to Reproduce
- Add 1500 devices to Nautobot
- Go to the
/metricsendpoint.
Do we know why the function is called 3 times? I would expect framework to call the function only once each time we fetch /metrics.
@ubajze @Kircheneer I'm having trouble replicating this. I got a local instance with 10,000 devices and this is what I get when running the profiling code:
3 0.0002541 8.47e-05 0.02821 0.009404 metrics.py:115(metrics_lcm_hw_end_of_support)
Can you tell me how many inventory items you have and how many HardwareLCM objects?
DLM metrics will be disabled by default in the new versions of DLM. Operators will be able to selectively enable metrics one-by-one, if desired.