beats Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo

trafficstars

Please post all questions and issues on https://discuss.elastic.co/c/beats before opening a Github Issue. Your questions will reach a wider audience there, and if we confirm that there is a bug, then you can open a new issue.

For security vulnerabilities please only send reports to [email protected]. See https://www.elastic.co/community/security for more information.

For confirmed bugs, please report:

Version: metricbeat-7.17.4-1.x86_64
Operating System: CentOS 7
Discuss Forum URL: There was an attempt, and also reported here, but has since auto closed.
Steps to Reproduce: Enable linux.pageinfo metricset

---
- module: linux
  period: 10s
  metricsets:
  - pageinfo

Error:

May 17 15:38:58 example.host metricbeat[12345]: 2022-05-17T15:38:58.783-0700        ERROR        module/wrapper.go:259        Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo: error parsing zone: : strconv.ParseInt: parsing "": invalid syntax

This appears to be due to the way pagetypeinfo is displayed when over 100000, and the way the module parses each column expecting integers, as seen roughly here in the regex expression and subsequent parsing of an expected integer.

Example: >100000 is reported by pagetypeinfo, but is more accurately represented as the total count of 213501 as seen below from buddyinfo

$ cat /proc/pagetypeinfo

Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      0      0      0      2      1      1      0      1      0      0
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable     45    490   2125   1535    908    429    139     10      0      0      0
Node    0, zone    DMA32, type  Reclaimable    554    760   1801   1748   1184    671    239     17      3      0      0
Node    0, zone    DMA32, type      Movable    206  33222  18654   4021    946    186     80     34      6      0      0
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable   1687  31115  18021  11061   3504    614     31      1      0      0      0
Node    0, zone   Normal, type  Reclaimable  15289  16293  13341  10028   4203    909     71     13      3      0      0
Node    0, zone   Normal, type      Movable      9 >100000  88777  13664   2611    818    240     14      5      0      0
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate
Node 0, zone      DMA            1            0            7            0            0            0
Node 0, zone    DMA32          165          190         1173            0            0            0
Node 0, zone   Normal         1093          724         4839            0            0            0

$ cat /proc/buddyinfo
Node 0, zone      DMA      1      0      0      0      2      1      1      0      1      1      3
Node 0, zone    DMA32    668  34432  22579   7304   3038   1286    458     61      9      0      0
Node 0, zone   Normal  17254 213501 120140  34753  10318   2341    342

I don't have a way to duplicate the memory fragmentation on a system easily, but once the system memory becomes fragmented, the kernel code roughly here truncates the pageinfotype counts over 100k and just displays them as >100000 which causes the parsing error mentioned.

Ideally, the regex for this should be corrected for the pagetypeinfo stats to at least handle the > indicator, and get proper totals from buddyinfo as mentioned by @Infraded.

Jun 21 '22 23:06 jplindquist

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Jul 04 '22 13:07 elasticmachine

same error from our setup:

# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      0      0      0      1      1      1      1      1      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      2
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable    475    434    197    179     68     32     16     12     10      0      0
Node    0, zone    DMA32, type      Movable  12157   6536   1695    225     79     49    379     88     28      1      0
Node    0, zone    DMA32, type  Reclaimable    191     92    537    141     12      1      1      1      1      0      0
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable     66    160    177     31     10     10      7      1      0      0      0
Node    0, zone   Normal, type      Movable     10 >100000  70473   3338    115    109     14     60     55      1      0
Node    0, zone   Normal, type  Reclaimable    994    603    162     16      1      0      0      0      0      0      0
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic      Isolate
Node 0, zone      DMA            3            5            0            0            0
Node 0, zone    DMA32           29         1405           94            0            0
Node 0, zone   Normal          841        29689          702            0            0

The error reported on logs:

2022-09-23T15:07:52.645-0500    ERROR   module/wrapper.go:259   Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo: error parsing zone: : strconv.ParseInt: parsing "": invalid syntax
2022-09-23T15:08:02.645-0500    ERROR   module/wrapper.go:259   Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo: error parsing zone: : strconv.ParseInt: parsing "": invalid syntax
2022-09-23T15:08:12.646-0500    ERROR   module/wrapper.go:259   Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo: error parsing zone: : strconv.ParseInt: parsing "": invalid syntax

Sep 23 '22 20:09 k001

I have same issue I have 2 same machines with same integration. On one machine I have no error , on the other one, I have this error. The only noticeable difference is the > sign on the failing machine. Don't know if it's the root cause

May 22 '23 07:05 Danouchka

cc @cmacknz

May 22 '23 07:05 Danouchka

Apologies for the delay here, this got buried under a dozen github notifications. Good news is, this is a fairly easy fix; I assume we'll just want to report the values as 100,000 and document that those values can be over a specified threshold.

Jun 26 '23 21:06 fearful-symmetry

Thanks @fearful-symmetry

If we can pull the total stats from buddyinfo if it's over 100k that would be great

Ideally, the regex for this should be corrected for the pagetypeinfo stats to at least handle the > indicator, and get proper totals from buddyinfo as mentioned by @Infraded.

Jun 26 '23 23:06 jplindquist

This would be an issue for users wanting to do accurate math on this data, which seems to me like the purpose of gathering and aggregating it into ES. This is a limitation of the pagetypeinfo, so a complete fix would be ~to switch to~ pulling full numbers from buddyinfo if possible

EDIT: Actually, the complete fix would be to include both sets of data since they represent two distinct categories of data. pagetypeinfo doesn't fully contain the counts of all types and buddyinfo doesn't contain the breakdown of types for it's counts.

Apologies for the delay here, this got buried under a dozen github notifications. Good news is, this is a fairly easy fix; I assume we'll just want to report the values as 100,000 and document that those values can be over a specified threshold.

Jun 27 '23 18:06 Infraded

I'm interested in this bug and have thought about it for a while now.

In brief: Supplementing pagetypeinfo with buddyinfo data adds error into the system. The two can not be captured fast enough for the numbers of one to be an adequate substitute for the other. Additionally, enough datum can be absent (>100000) in pagetypeinfo that attempts to fill the gaps with data from buddyinfo are not technically possible.

Regarding what to do next with this bug, I think

The regex that processes each line needs to match for (>?\d*) instead, then when filling out the nodes[nodeLevel] .zones.Normal map everything that can be strconvd back into an int is converted back. This will fix strconv.ParseInt: parsing "": invalid syntax
We're left with what to do about the >100000 items that couldn't be converted back. You'll see below that without making numbers up, we can not actually produce a real number for these datum. I think what @fearful-symmetry said earlier is the correct approach. Report it as 100,000 and make notes in the documentation that this is a kernel limitation. I would go further and say that referencing the buddyinfo data in a separate chart is a great supplement.

The chance of the calls to grab the pagetypeinfo and buddyinfo data being out of sync is basically a promise. On a busy system you can cat those files out in a tight loop and watch the numbers in any column rise and fall quickly by 10s of thousands. There's no guarantee that the data in buddyinfo will actually give you a real count for what you're missing in pagetypeinfo.

I would say that on a production system you are fabricating inaccurate numbers by doing this math.

On a system right now (24 CPUs, 2 numa nodes, 64 GB memory, load of 20+) I could not get the two data sets to print out quick enough that I could actually add the numbers in one set to match the numbers in the opposite data set.

The comment in the kernel code linked in the OP said it best:

				 * [...]                Anyway this is a
				 * debugging tool so knowing there is a handful
				 * of pages of this order should be more than
				 * sufficient.
				 */

pagetypeinfo isn't meant to replace a proper heap analysis. Knowing that there are a lot of pages in that order is enough signal.

This is not as likely to happen, but if two fields are >100000 then you won't have enough data to substitute in the correct values.

Apr 22 '24 20:04 tbielawatt

So, I have a fix here: https://github.com/elastic/beats/pull/39985

I ended up siding with the comment here: https://github.com/elastic/beats/issues/32026#issuecomment-2070884877 in general, trying to "correct" the >100000 metrics with data from buddyinfo is nearly guaranteed to produce weird data.

Jun 20 '24 21:06 fearful-symmetry

beats beats copied to clipboard

Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo

beats
beats copied to clipboard