nhc icon indicating copy to clipboard operation
nhc copied to clipboard

nhc running on KNL node

Open jmcdonal opened this issue 7 years ago • 2 comments

Hi,

I'm running into an issue with nhc on a knights landing node. The hardware is a HPE XL260.

uname -a

Linux cn3102 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

#cat /etc/redhat-release CentOS Linux release 7.3.1611 (Core)

The node has 256 GB of memory on it, but the check_hw_mem_free command of NHC is timing out:

nhc -d -v

DEBUG: Debugging activated via -d option. DEBUG: Verbose mode activated via -v option. [0] - DEBUG: NHC process 74923 is session leader. [1502985678] - ERROR: nhc: Health check failed: Script timed out while executing "check_hw_mem_free 500mb".


cat /proc/meminfo

MemTotal: 280321744 kB MemFree: 273449596 kB MemAvailable: 274450436 kB Buffers: 2224 kB Cached: 2446640 kB SwapCached: 28 kB Active: 1170856 kB Inactive: 1600764 kB Active(anon): 482984 kB Inactive(anon): 152932 kB Active(file): 687872 kB Inactive(file): 1447832 kB Unevictable: 111348 kB Mlocked: 111348 kB SwapTotal: 4194300 kB SwapFree: 4194272 kB Dirty: 8 kB Writeback: 0 kB AnonPages: 434260 kB Mapped: 104612 kB Shmem: 304576 kB Slab: 879420 kB SReclaimable: 261204 kB SUnreclaim: 618216 kB KernelStack: 46880 kB PageTables: 4808 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 144355172 kB Committed_AS: 984060 kB VmallocTotal: 34359738367 kB VmallocUsed: 1295136 kB VmallocChunk: 34358355964 kB HardwareCorrupted: 0 kB AnonHugePages: 241664 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 707068 kB DirectMap2M: 10692608 kB DirectMap1G: 275775488 kB

Is there a fix for this or how can I debug?

Thanks, Jeff

jmcdonal avatar Aug 17 '17 16:08 jmcdonal

check_hw_mem_free calls nhc_hw_gather_data which will scan through /proc/cpuinfo. nhc_hw_gather_data uses the bash builtin read which will call lseek with a negative seek on its descriptor after each newline it finds to reset the file to right after the newline it just found. Because cpuinfo is a seq_file and not regular file, each lseek requires a complete rebuild of cpuinfo up to the offset desired. For short cpuinfo files, this doesn't really matter but our KNL nodes have 272 processors resulting in just over 7k lines. This ends up causing the cpuinfo scan to take 40 seconds on our KNL nodes, causing any check that relies on nhc_hw_gather_data to fail under a 30 second timeout.

I was able to fix this by modifying nhc_hw_gather_data to copy /proc/cpuinfo to a temporary file and then scan that, reducing the check time to around 3 seconds.

Thanks, Matt Mix

mattmix avatar Aug 24 '17 16:08 mattmix

See also issue #30 and pull request #47

NateCrawford avatar Sep 21 '17 21:09 NateCrawford

Based on testing and feedback, #121 has addressed this issue sufficiently to warrant its closure; however, if your own testing or deployment experience(s) differ, please do reopen this one, or a new one, at your discretion! 😃

mej avatar Apr 10 '23 21:04 mej