bosh icon indicating copy to clipboard operation
bosh copied to clipboard

Stale metrics-server files in /var/vcap/store/director/metrics

Open ybykov-a9s opened this issue 3 months ago • 1 comments

Hello everyone,

I was investigating why http_server_request_duration_seconds_bucket metric scraped from BOSH metrics-server is the most populated metric in TSDB of Prometheus for several BOSH directors. TSDB status of Prometheus showed that it's present millions times.

Eventually I found that on significant part of existing directors directory /var/vcap/store/director/metrics might contain something like (take a look on dates of files):

-rw-r--r-- 1 vcap vcap 1048576 Aug 29 08:02 metric_http_server_exceptions_total___32.bin
-rw-r--r-- 1 vcap vcap 1048576 Aug 29 08:01 metric_http_server_exceptions_total___34.bin
-rw-r--r-- 1 vcap vcap 1048576 Mar  6  2023 metric_http_server_request_duration_seconds___19.bin
-rw-r--r-- 1 vcap vcap 1048576 Nov 22  2024 metric_http_server_request_duration_seconds___20.bin
-rw-r--r-- 1 vcap vcap 2097152 Dec 23  2024 metric_http_server_request_duration_seconds___21.bin
-rw-r--r-- 1 vcap vcap 2097152 Nov 22  2024 metric_http_server_request_duration_seconds___22.bin
-rw-r--r-- 1 vcap vcap 2097152 Dec 23  2024 metric_http_server_request_duration_seconds___23.bin
-rw-r--r-- 1 vcap vcap 1048576 Dec 23  2024 metric_http_server_request_duration_seconds___24.bin
-rw-r--r-- 1 vcap vcap 1048576 Nov 22  2024 metric_http_server_request_duration_seconds___25.bin
-rw-r--r-- 1 vcap vcap 1048576 Jul 31  2024 metric_http_server_request_duration_seconds___26.bin
-rw-r--r-- 1 vcap vcap 1048576 Sep  8  2022 metric_http_server_request_duration_seconds___27.bin
-rw-r--r-- 1 vcap vcap 1048576 Jul 19  2022 metric_http_server_request_duration_seconds___28.bin
-rw-r--r-- 1 vcap vcap 1048576 May 31  2023 metric_http_server_request_duration_seconds___29.bin
-rw-r--r-- 1 vcap vcap 1048576 Aug 26 13:05 metric_http_server_request_duration_seconds___31.bin
-rw-r--r-- 1 vcap vcap 1048576 Aug 29 08:06 metric_http_server_request_duration_seconds___32.bin
-rw-r--r-- 1 vcap vcap 1048576 Jun 25 14:27 metric_http_server_request_duration_seconds___33.bin
-rw-r--r-- 1 vcap vcap 1048576 Aug 29 08:06 metric_http_server_request_duration_seconds___34.bin
-rw-r--r-- 1 vcap vcap 1048576 Aug 26 13:06 metric_http_server_request_duration_seconds___35.bin
-rw-r--r-- 1 vcap vcap 1048576 Jul 30 13:32 metric_http_server_request_duration_seconds___36.bin
-rw-r--r-- 1 vcap vcap 1048576 Aug 26 13:06 metric_http_server_request_duration_seconds___37.bin
-rw-r--r-- 1 vcap vcap 1048576 Aug 29 08:06 metric_http_server_request_duration_seconds___38.bin
-rw-r--r-- 1 vcap vcap 1048576 Jun 23 18:31 metric_http_server_request_duration_seconds___39.bin
-rw-r--r-- 1 vcap vcap 1048576 Mar 25 15:06 metric_http_server_request_duration_seconds___40.bin
-rw-r--r-- 1 vcap vcap 1048576 May 27 10:49 metric_http_server_request_duration_seconds___41.bin
-rw-r--r-- 1 vcap vcap 1048576 Mar 25 15:06 metric_http_server_request_duration_seconds___43.bin
-rw-r--r-- 1 vcap vcap 1048576 Jan 31  2025 metric_http_server_request_duration_seconds___44.bin

And while scraping a BOSH director's metrics-server endpoint, scraped data could easily be hundreds of megabytes in size. It contained hundreds of thousands of old metrics.

Stopping the director, removing those files and restarting director again helped, like it's described in issue 2332.

My question is - does it make sense to check for outdated metrics-server file store and clean it up on director start? Because it looks like the problem is old enough. And I see this problem on a significant part of existing deployed directors.

This particular director is deployed on AWS and has bosh-deployment commit Stemcell: light-bosh-stemcell-1.894-aws-xen-hvm-ubuntu-jammy-go_agent.tgz

Expected behavior BOSH Metrics-server removes outdated obsolete files from its file store and doesn't show metrics from obsolete files on its scrape endpoint.

ybykov-a9s avatar Aug 29 '25 12:08 ybykov-a9s

Makes sense, open for contribution ...

a-hassanin avatar Sep 04 '25 15:09 a-hassanin