beats
beats copied to clipboard
Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo
Please post all questions and issues on https://discuss.elastic.co/c/beats before opening a Github Issue. Your questions will reach a wider audience there, and if we confirm that there is a bug, then you can open a new issue.
For security vulnerabilities please only send reports to [email protected]. See https://www.elastic.co/community/security for more information.
For confirmed bugs, please report:
- Version: metricbeat-7.17.4-1.x86_64
- Operating System: CentOS 7
- Discuss Forum URL: There was an attempt, and also reported here, but has since auto closed.
- Steps to Reproduce: Enable linux.pageinfo metricset
---
- module: linux
period: 10s
metricsets:
- pageinfo
Error:
May 17 15:38:58 example.host metricbeat[12345]: 2022-05-17T15:38:58.783-0700 ERROR module/wrapper.go:259 Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo: error parsing zone: : strconv.ParseInt: parsing "": invalid syntax
This appears to be due to the way pagetypeinfo is displayed when over 100000, and the way the module parses each column expecting integers, as seen roughly here in the regex expression and subsequent parsing of an expected integer.
Example:
>100000 is reported by pagetypeinfo, but is more accurately represented as the total count of 213501 as seen below from buddyinfo
$ cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 1 0 0 0 2 1 1 0 1 0 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 45 490 2125 1535 908 429 139 10 0 0 0
Node 0, zone DMA32, type Reclaimable 554 760 1801 1748 1184 671 239 17 3 0 0
Node 0, zone DMA32, type Movable 206 33222 18654 4021 946 186 80 34 6 0 0
Node 0, zone DMA32, type Reserve 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 1687 31115 18021 11061 3504 614 31 1 0 0 0
Node 0, zone Normal, type Reclaimable 15289 16293 13341 10028 4203 909 71 13 3 0 0
Node 0, zone Normal, type Movable 9 >100000 88777 13664 2611 818 240 14 5 0 0
Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Number of blocks type Unmovable Reclaimable Movable Reserve CMA Isolate
Node 0, zone DMA 1 0 7 0 0 0
Node 0, zone DMA32 165 190 1173 0 0 0
Node 0, zone Normal 1093 724 4839 0 0 0
$ cat /proc/buddyinfo
Node 0, zone DMA 1 0 0 0 2 1 1 0 1 1 3
Node 0, zone DMA32 668 34432 22579 7304 3038 1286 458 61 9 0 0
Node 0, zone Normal 17254 213501 120140 34753 10318 2341 342
I don't have a way to duplicate the memory fragmentation on a system easily, but once the system memory becomes fragmented, the kernel code roughly here truncates the pageinfotype counts over 100k and just displays them as >100000 which causes the parsing error mentioned.
Ideally, the regex for this should be corrected for the pagetypeinfo stats to at least handle the > indicator, and get proper totals from buddyinfo as mentioned by @Infraded.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
same error from our setup:
# cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 0 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 2
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 475 434 197 179 68 32 16 12 10 0 0
Node 0, zone DMA32, type Movable 12157 6536 1695 225 79 49 379 88 28 1 0
Node 0, zone DMA32, type Reclaimable 191 92 537 141 12 1 1 1 1 0 0
Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 66 160 177 31 10 10 7 1 0 0 0
Node 0, zone Normal, type Movable 10 >100000 70473 3338 115 109 14 60 55 1 0
Node 0, zone Normal, type Reclaimable 994 603 162 16 1 0 0 0 0 0 0
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Number of blocks type Unmovable Movable Reclaimable HighAtomic Isolate
Node 0, zone DMA 3 5 0 0 0
Node 0, zone DMA32 29 1405 94 0 0
Node 0, zone Normal 841 29689 702 0 0
The error reported on logs:
2022-09-23T15:07:52.645-0500 ERROR module/wrapper.go:259 Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo: error parsing zone: : strconv.ParseInt: parsing "": invalid syntax
2022-09-23T15:08:02.645-0500 ERROR module/wrapper.go:259 Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo: error parsing zone: : strconv.ParseInt: parsing "": invalid syntax
2022-09-23T15:08:12.646-0500 ERROR module/wrapper.go:259 Error fetching data for metricset linux.pageinfo: error reading pagetypeinfo: error parsing zone: : strconv.ParseInt: parsing "": invalid syntax
I have same issue I have 2 same machines with same integration. On one machine I have no error , on the other one, I have this error. The only noticeable difference is the > sign on the failing machine. Don't know if it's the root cause
cc @cmacknz
Apologies for the delay here, this got buried under a dozen github notifications. Good news is, this is a fairly easy fix; I assume we'll just want to report the values as 100,000 and document that those values can be over a specified threshold.
Thanks @fearful-symmetry
If we can pull the total stats from buddyinfo if it's over 100k that would be great
Ideally, the regex for this should be corrected for the
pagetypeinfostats to at least handle the > indicator, and get proper totals frombuddyinfoas mentioned by @Infraded.
This would be an issue for users wanting to do accurate math on this data, which seems to me like the purpose of gathering and aggregating it into ES. This is a limitation of the pagetypeinfo, so a complete fix would be ~to switch to~ pulling full numbers from buddyinfo if possible
EDIT: Actually, the complete fix would be to include both sets of data since they represent two distinct categories of data. pagetypeinfo doesn't fully contain the counts of all types and buddyinfo doesn't contain the breakdown of types for it's counts.
Apologies for the delay here, this got buried under a dozen github notifications. Good news is, this is a fairly easy fix; I assume we'll just want to report the values as
100,000and document that those values can be over a specified threshold.
I'm interested in this bug and have thought about it for a while now.
In brief: Supplementing pagetypeinfo with buddyinfo data adds error into the system. The two can not be captured fast enough for the numbers of one to be an adequate substitute for the other. Additionally, enough datum can be absent (>100000) in pagetypeinfo that attempts to fill the gaps with data from buddyinfo are not technically possible.
Regarding what to do next with this bug, I think
- The regex that processes each line needs to match for
(>?\d*)instead, then when filling out thenodes[nodeLevel] .zones.Normalmap everything that can bestrconvd back into an int is converted back. This will fixstrconv.ParseInt: parsing "": invalid syntax - We're left with what to do about the
>100000items that couldn't be converted back. You'll see below that without making numbers up, we can not actually produce a real number for these datum. I think what @fearful-symmetry said earlier is the correct approach. Report it as 100,000 and make notes in the documentation that this is a kernel limitation. I would go further and say that referencing thebuddyinfodata in a separate chart is a great supplement.
The chance of the calls to grab the pagetypeinfo and buddyinfo data being out of sync is basically a promise. On a busy system you can cat those files out in a tight loop and watch the numbers in any column rise and fall quickly by 10s of thousands. There's no guarantee that the data in buddyinfo will actually give you a real count for what you're missing in pagetypeinfo.
I would say that on a production system you are fabricating inaccurate numbers by doing this math.
On a system right now (24 CPUs, 2 numa nodes, 64 GB memory, load of 20+) I could not get the two data sets to print out quick enough that I could actually add the numbers in one set to match the numbers in the opposite data set.
The comment in the kernel code linked in the OP said it best:
* [...] Anyway this is a
* debugging tool so knowing there is a handful
* of pages of this order should be more than
* sufficient.
*/
pagetypeinfo isn't meant to replace a proper heap analysis. Knowing that there are a lot of pages in that order is enough signal.
This is not as likely to happen, but if two fields are >100000 then you won't have enough data to substitute in the correct values.
So, I have a fix here: https://github.com/elastic/beats/pull/39985
I ended up siding with the comment here: https://github.com/elastic/beats/issues/32026#issuecomment-2070884877
in general, trying to "correct" the >100000 metrics with data from buddyinfo is nearly guaranteed to produce weird data.