telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

[[inputs.mem]] Incorrect "used" value since v1.36 combining used + shared

Open Daryes opened this issue 3 months ago • 9 comments

Relevant telegraf.conf

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

Logs from Telegraf

no errors

System info

Telegraf 1.36.2 (git: HEAD@8bdd0265) Ubuntu 22.04.5 LTS Linux 5.15.0-153-generic #163-Ubuntu SMP Thu Aug 7 16:37:18 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Docker

N/A

Steps to reproduce

  1. update from telegraf 1.35.4 to 1.36.2
  2. check the recorded data relative to the memory metric "used"
  3. see that any graph or value with the memory metric "used" change drastically.

Expected behavior

no change in the "used" metric after the update.

Actual behavior

After investigation, with v1.36.2, the metric "used" is now combining the "used" + "shared" values. Reverting to v1.35.4 will bring back the expected values.

I didn't see anything mentioning this change in the v1.36.* release notes, while it is really impactful for monitoring the system memory.

Additional info

Very noticeable on this graph, with the sudden increase of consumption was when updating to v1.36.2 the 10/02 at 06h. The system shown has a large chunk of shared memory used by virtual machines, and only 32 Gb of ram. When stacked, the new "used" metric show clearly being a combination of both "used" and "shared", and will go over the maximum amount of ram.

Image left part : v1.35.4 up to 10/02 - 06h00, right part : v1.36.2

No other process was restarted aside telegraf. All the other graphs related to the memory usage but without the metric "used" stayed the same as expected before and after the update.

Daryes avatar Oct 03 '25 09:10 Daryes

We experience the same problem. Updated from 1.36.1 to 1.36.2 on Debian 11.

JasperExonet avatar Oct 03 '25 12:10 JasperExonet

Root Cause: The gopsutil library upgrade from v3 to v4 (https://github.com/influxdata/telegraf/pull/16023) changed how memory "used" is calculated on Linux:

  • Old formula (v3): Used = Total - Free - Buffers - Cached
  • New formula (v4): Used = Total - Available

Why it happened:

  • gopsutil v4 adopted the more accurate Linux kernel calculation via https://github.com/shirou/gopsutil/issues/1873
  • The old formula was less accurate because Cached includes memory that isn't actually freeable
  • This change was intentional to provide more accurate memory metrics

skartikey avatar Oct 03 '25 19:10 skartikey

@skartikey Your answer is misleading. Your first link (16023) is related to the library upgrade in telegraf one year ago. This is not related to this issue, as this problem appeared in those last weeks.

The second link seems to be the explanation : the change occurred in gopsutil at late august, and released with v4.25.8. And integrated in telegraf 1.36.2

Daryes avatar Oct 05 '25 08:10 Daryes

Any news about this ? This change actually requires to update all the formulas in dashboard and alerts making use of this metric : the new calculation is disastrous for monitoring servers running databases, virtual machines or other applications using a large chunk or shared memory.

Daryes avatar Oct 15 '25 09:10 Daryes

Also noticed this and can't find anything relevant in changelog. Reverting to 1.36.1 fixes it for the moment.

onnos avatar Oct 17 '25 08:10 onnos

As mentioned earlier, the issue appears to be the gopsutil update. I was already graphing shared memory separately and considering it used before. Essentially to get the same used number as before you can do total - free - buffered - cached.

I do think this is "more correct" in that, for example, available_percent + used_percent is now actually 100% so I support the change. Just a bit unfortunate that it was snuck in on a minor point release. This really needs a big bold changelog entry to notify people.

onnos avatar Oct 17 '25 09:10 onnos

@srebhan , @skartikey Can we have someone look at this and have it handled properly in telegraf ?

Be it "more correct" or not, as telegraf relies completely on gopsutil, the change is unavoidable. Still, it is highly intrusive with system using shared memory a lot, making any formula for the memory usage unusable until updated.

Given that goputil has a new parameter "useOldMemCalc" available for 6 months, I would suggest :

  • activating this parameter to "true" in Telegraf v1.36 and making it accessible as a new configuration parameter for [inputs.mem]
  • documenting it, stating it is already deprecated, to be removed in ? months
  • for telegraf v1.37.0 :
    • add a big warning in the release notes about the change of memory usage calculation.
    • switch the parameter to "false" by default, can still be updated in the [inputs.mem] configuration
  • in ~5 months, remove the deprecated parameter for the next 1.*.0 version, mentioned in the release notes.

That should cover the situation

Daryes avatar Nov 17 '25 18:11 Daryes

@Daryes I disagree here:

Still, it is highly intrusive with system using shared memory a lot, making any formula for the memory usage unusable until updated.

I think you're under the impression used + shared was at all an accurate measurement before. Shared was never part of used. You can check by seeing if your totals adds up. Shared memory, as the name implies, is shared dynamically with buffers and cache. It's counterintuitive but you can make sense of it by checking the output of free -h and adding up totals. You'll see you can get to a total without counting shared, but including shared will go over total.

onnos avatar Nov 17 '25 19:11 onnos

@onnos There is what should be, and there is what was done in gopsutil. You should read the issue 1873 linked before to have a better understanding. Rogercoll explained the situation and the needs clearly in the first message of said issue.

This said, I think you're under the impression I'm advocating for changing back the measurement and keeping it as is. And because you think the new calculation is better, you refuse to listen to anything different, wanting everybody else to comply to your desire while being blind to the real problem behind.

There was never any debate about how one calculation or the other is better, I don't give a damn about it. The problem lies with how the change came to Telegraf : in an upgrade of the dependency in 1.36.2, which is a patch release, without any warnings nor information, while Telegraf follows the semver versioning, which has meanings. Yet, the change itself is a breaking one, and warrant to be applied only in a .0 release, at least a minor one, if not major. With a "breaking" message in the notes.

That is the issue here.

What I care is that formulas were created around the data from telegraf to calculate accurately the memory usage, integrating the quirks of said data. And what I am advocating is that this change is rollbacked or fixed in the 1.36.* series, then correctly introduced in 1.37.0 (or 1.38.0 or another .0 release, that's the Telegraf's team decision) with a visible mention in the release notes.

Such change in a patch release without is bound to have repercussions, raising wrong alerts through the monitoring, I got burnt by this and locked updates to the stable 1.35. No need to have other people having to deal with the same situation when the only information available is buried in this issue. Mistakes happens, I've got no problem with that. But there was 2 v1.36 patch releases since the introduction of this change without anything about it, which is worrisome.

Daryes avatar Nov 18 '25 11:11 Daryes