windows_exporter icon indicating copy to clipboard operation
windows_exporter copied to clipboard

Logical Disk free_bytes and size_bytes counter not updating instantly

Open JDA88 opened this issue 4 years ago • 12 comments

Hi,

Ok this one is a weird one, I noticed that the value of the windows_logical_disk_size_bytes counter was not updating instantly after a drive extension. I then looked at the LogicalDisk > Free Megabytes windows counter (you probably use the raw version of this one or one of is friends) and it is not updating either.

If you wait for a bit (10-15min), the counter (on windows and on the exporter) end up updating. If you restart the exporter service, it looks liks it updates instantly (might have to test this one further).

Digging a little bit I found this old article https://www.ibm.com/support/pages/freembytes-not-matching-value-perfmon that gave me a hint.

From my understanding some disk performance counter are considered costly to calculate and are not updated in real time. It looks like the drive free space / disk sizes are part of them. The questions are:

  • Is this behavior intended for the exporter (is it the same on the node_exporter?)
  • Is there a way to do better (without altering the registry?) The API call return the correct drive size instantly. (It might be overkill to call it every time but updating the value once a minute with a cache shoule be enough.)

My main goal is to point the current behaviors to have it at least documented and then if there is a way to do better, great!

JDA88 avatar Sep 07 '21 07:09 JDA88

This is a very good find! I think at a minimum we should add this to the logical_disk collector documentation. Would you mind submitting a PR for the documentation?

I don't think changing the behavior of a particular collector to cache a result or return a value only every minute would be ideal, but perhaps there needs to be some discussion first.

breed808 avatar Sep 30 '21 20:09 breed808

This is a very good find! I think at a minimum we should add this to the logical_disk collector documentation. Would you mind submitting a PR for the documentation?

Do you think I shoud add a comment directly on the counter description or a warning at the end?

JDA88 avatar Sep 30 '21 20:09 JDA88

I think a brief note in the counter description, and a more detailed description in the collector documentation.

breed808 avatar Sep 30 '21 20:09 breed808

Ok forgive me if I didn't do it right but I tried a edit of the doc there: https://github.com/prometheus-community/windows_exporter/pull/846

Ho ok, again this DCO stuff, sorry I don’t understand why you can’t do an simple modification like that directly from the website…

JDA88 avatar Sep 30 '21 20:09 JDA88

You might be able to add the DCO via the website. If you add a Signed-off-by at the bottom of the commit description it may pass.

E.G. Signed-off-by: Ben Reedy <[email protected]>

If not you'll have to use git commit -s --amend to amend the commit with the sign-off.

breed808 avatar Sep 30 '21 21:09 breed808

Now that the documentation is updated. There is the enhance option left. I know Win32 and .NET but unfortunately, I don’t know anything about GO, so I can’t really help there.

JDA88 avatar Sep 30 '21 22:09 JDA88

I can confirm that the free_bytes metric doesn't update in real time.

The graph bellow show a clear 5 min step increase on this server even though the scrap interval is done every 15 seconds and the data injection was done at a constant rate: image

The impact of this delay is different depending on the metric:

  • Delay on size_bytes: Mean than an alert will stay active more time than it should > Impact low
  • Delay on free_bytes: Means that an alert might be delayed for 5-10 min > Impact very high on some of our servers

JDA88 avatar Oct 07 '21 13:10 JDA88

Hmm, are there any other sources we could use for these metrics, that update more frequently? While more recent metrics are preferable, increased resource usage and/or metric caching wouldn't be desirable.

breed808 avatar Oct 23 '21 06:10 breed808

I think most monitoring system rely directly on API calls, I would say: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getdiskfreespaceexa https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getvolumeinformationw One of the benefits would also be the ability to get the volume name. As I don’t know GO, I have no idea if a library exists to cover those but I agree that a cache of 1min sound reasonable to prevent resource usage.

JDA88 avatar Oct 23 '21 06:10 JDA88

I've had a look and we're currently using the windows library for the service collector.

This library exposes GetDiskFreeSpaceEx and GetVolumeInformation functions which we can use. I'm not familiar with the win32 API, so I'll need to find usage examples or someone else can try to make use of the functions.

We also need to consider the caching implementation, which I don't believe has been implemented for this exporter.

breed808 avatar Oct 23 '21 07:10 breed808

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

github-actions[bot] avatar Nov 25 '23 02:11 github-actions[bot]

The 5 min delay to spot issue have an impact on alerts reactivity

JDA88 avatar Nov 25 '23 09:11 JDA88

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

github-actions[bot] avatar Feb 24 '24 02:02 github-actions[bot]