nhc icon indicating copy to clipboard operation
nhc copied to clipboard

Optimize nhc_hw_gather_data for gathering CPU information on KNL nodes

Open wilshire461 opened this issue 7 years ago • 8 comments

While the code is perfectly functional in it's current state, the function nhc_hw_gather_data can take upwards of 40-60 seconds on multithreaded KNL nodes. It would be nice to optimize this portion of the check so that we aren't forced to wait 10's of seconds to grab CPU information.

If it matters we are using lbnl-nhc-1.4.2-1, and have tried using the -f flag, which doesn't appear to help.

[root@sknl0705 ~]# /usr/sbin/nhc -f ERROR: nhc: Health check failed: Script timed out while executing "check_hw_cpuinfo 1 68 272".

wilshire461 avatar Apr 03 '17 21:04 wilshire461

After thinking about this a bit more, maybe a whole new set of checks could be created instead specifically targeting KNL that would also check for cluster, memory mode, etc.

wilshire461 avatar Apr 04 '17 01:04 wilshire461

With respect to the original issue, I'm definitely open to suggestions on how to optimize nhc_hw_gather_data(), either on Knight's Landing or just in general. I'm not sure what's taking so long, or why, though...so we'd need to figure that out first. Running nhc -x and watching the progress of the bash trace mode is a good way to see where it's spending all its time, if you want to give that a try. I realize /proc/cpuinfo is quite lengthy on the KNLs, but I still have a hard time understanding why it would take that long!

And for what it's worth, the -f option doesn't mean the checks run in parallel (though that's a feature that's on my NHC roadmap!) but rather that they are initially backgrounded and then waited on by NHC; this alters the way signal delivery is handled within bash but is not an optimization or parallelization technique. :-)

As far as the KNL platform goes, we at LANL are investigating the possibility of using NHC on our Cray systems, including Trinity, which is over half KNLs. This will likely afford me the opportunity to add KNL-specific functionality. In the meantime, however, contributions are certainly welcome as always!

mej avatar Apr 05 '17 02:04 mej

Fairly certain its the piece of code that grabs information about all the processors as I mentioned in the issue title, we broke that portion of the code out and it was routinely taking 40 seconds for a KNL node with 272 virtual cores.

Here is the output of nhc -x on one of the nodes:

[root@sknl0701 ~]# nhc -x
+ dbg 'BASH tracing active.'
+ local PREFIX=
+ [[ '' == \1 ]]
+ getopts :D:ac:de:fhl:n:qt:vx OPTION
+ shift 1
+ [[ ! -z '' ]]
+ return 0
+ nhcmain_load_sysconfig
+ [[ -f /etc/sysconfig/nhc ]]
+ nhcmain_finalize_env
+ CONFDIR=/etc/nhc
+ CONFFILE=/etc/nhc/nhc.conf
+ INCDIR=/etc/nhc/scripts
+ HELPERDIR=/usr/libexec/nhc
+ ONLINE_NODE=/usr/libexec/nhc/node-mark-online
+ OFFLINE_NODE=/usr/libexec/nhc/node-mark-offline
+ LOGFILE='>>/var/log/nhc.log 2>&1'
+ RESULTFILE=/var/run/nhc/nhc.status
+ DEBUG=0
+ TS=0
+ SILENT=0
+ VERBOSE=0
+ MARK_OFFLINE=1
+ DETACHED_MODE=0
+ DETACHED_MODE_FAIL_NODATA=0
+ TIMEOUT=30
+ NHC_CHECK_ALL=0
+ NHC_CHECK_FORKED=0
+ export NHC_SID=0
+ NHC_SID=0
+ kill -s 0 -- -269885
+ [[ 0 -eq 0 ]]
+ dbg 'NHC process 269885 is session leader.'
+ local PREFIX=
+ [[ 0 == \1 ]]
+ NHC_SID=-269885
+ [[ >>/var/log/nhc.log 2>&1 != \>\>\/\v\a\r\/\l\o\g\/\n\h\c\.\l\o\g\ \2\>\&\1 ]]
+ [[ >>/var/log/nhc.log 2>&1 == \- ]]
+ [[ -z '' ]]
+ nhcmain_find_rm
+ local DIR
+ local -a DIRLIST
+ [[ -d /var/spool/torque ]]
+ [[ -n '' ]]
+ type -a -p -f -P pbsnodes
+ type -a -p -f -P scontrol
+ NHC_RM=slurm
+ return 0
+ [[ slurm == \s\g\e ]]
+ [[ 0 -ne 0 ]]
+ [[ -n '' ]]
+ [[ 0 -eq 1 ]]
+ export NAME CONFDIR CONFFILE INCDIR HELPERDIR ONLINE_NODE OFFLINE_NODE LOGFILE DEBUG TS SILENT TIMEOUT NHC_RM
+ [[ -n '' ]]
+ nhcmain_redirect_output
+ [[ -n >>/var/log/nhc.log 2>&1 ]]
+ exec
ERROR:  nhc:  Health check failed:  Script timed out while executing "check_hw_cpuinfo 1 68 272".
+++ [[ 0 == \1 ]]
+++ exit 142

wilshire461 avatar Apr 05 '17 20:04 wilshire461

I still have a hard time understanding why it would take that long!

After reading more carefully through the code I started to wonder if this behavior is consistent with the amount of time it takes to collect cpuinfo on a normal CPU and it turns out the check is quite quick!

2x Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz

[root@smem0101 ~]# ./cpu_count.sh 
01:42:34 Finished Proc 0
...
01:42:34 Finished Proc 27
01:42:35 Finished Proc 28
...
01:42:35 Finished Proc 47

real	0m0.467s
user	0m0.118s
sys	0m0.353s

Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz

[root@sknl0701 ~]# ./cpu_count.sh
01:44:42 Finished Proc 0
...
01:44:42 Finished Proc 21
01:44:43 Finished Proc 22
...
01:44:43 Finished Proc 40
01:44:44 Finished Proc 41
...
01:44:44 Finished Proc 55
01:44:45 Finished Proc 56
...
01:44:45 Finished Proc 68
01:44:46 Finished Proc 69
...
01:44:46 Finished Proc 78
01:44:47 Finished Proc 79
...
01:44:47 Finished Proc 88
01:44:48 Finished Proc 89
...
01:44:48 Finished Proc 97
01:44:49 Finished Proc 98
...
01:44:49 Finished Proc 105
01:44:50 Finished Proc 106
...
01:44:50 Finished Proc 113
01:44:51 Finished Proc 114
...
01:44:51 Finished Proc 120
01:44:52 Finished Proc 121
...
01:44:52 Finished Proc 127
01:44:53 Finished Proc 128
...
01:44:53 Finished Proc 134
01:44:54 Finished Proc 135
...
01:44:54 Finished Proc 140
01:44:55 Finished Proc 141
...
01:44:55 Finished Proc 147
01:44:56 Finished Proc 148
...
01:44:56 Finished Proc 152
01:44:57 Finished Proc 153
...
01:44:57 Finished Proc 158
01:44:58 Finished Proc 159
...
01:44:58 Finished Proc 164
01:44:59 Finished Proc 165
01:44:59 Finished Proc 166
01:44:59 Finished Proc 167
01:44:59 Finished Proc 168
01:44:59 Finished Proc 169
01:45:00 Finished Proc 170
01:45:00 Finished Proc 171
01:45:00 Finished Proc 172
01:45:00 Finished Proc 173
01:45:00 Finished Proc 174
01:45:01 Finished Proc 175
01:45:01 Finished Proc 176
01:45:01 Finished Proc 177
01:45:01 Finished Proc 178
01:45:01 Finished Proc 179
01:45:02 Finished Proc 180
01:45:02 Finished Proc 181
01:45:02 Finished Proc 182
01:45:02 Finished Proc 183
01:45:02 Finished Proc 184
01:45:03 Finished Proc 185
01:45:03 Finished Proc 186
01:45:03 Finished Proc 187
01:45:03 Finished Proc 188
01:45:03 Finished Proc 189
01:45:04 Finished Proc 190
01:45:04 Finished Proc 191
01:45:04 Finished Proc 192
01:45:04 Finished Proc 193
01:45:04 Finished Proc 194
01:45:05 Finished Proc 195
01:45:05 Finished Proc 196
01:45:05 Finished Proc 197
01:45:05 Finished Proc 198
01:45:06 Finished Proc 199
01:45:06 Finished Proc 200
01:45:06 Finished Proc 201
01:45:06 Finished Proc 202
01:45:06 Finished Proc 203
01:45:07 Finished Proc 204
01:45:07 Finished Proc 205
01:45:07 Finished Proc 206
01:45:07 Finished Proc 207
01:45:08 Finished Proc 208
01:45:08 Finished Proc 209
01:45:08 Finished Proc 210
01:45:08 Finished Proc 211
01:45:08 Finished Proc 212
01:45:09 Finished Proc 213
01:45:09 Finished Proc 214
01:45:09 Finished Proc 215
01:45:09 Finished Proc 216
01:45:10 Finished Proc 217
01:45:10 Finished Proc 218
01:45:10 Finished Proc 219
01:45:10 Finished Proc 220
01:45:10 Finished Proc 221
01:45:11 Finished Proc 222
01:45:11 Finished Proc 223
01:45:11 Finished Proc 224
01:45:11 Finished Proc 225
01:45:12 Finished Proc 226
01:45:12 Finished Proc 227
01:45:12 Finished Proc 228
01:45:12 Finished Proc 229
01:45:13 Finished Proc 230
01:45:13 Finished Proc 231
01:45:13 Finished Proc 232
01:45:13 Finished Proc 233
01:45:14 Finished Proc 234
01:45:14 Finished Proc 235
01:45:14 Finished Proc 236
01:45:14 Finished Proc 237
01:45:15 Finished Proc 238
01:45:15 Finished Proc 239
01:45:15 Finished Proc 240
01:45:15 Finished Proc 241
01:45:16 Finished Proc 242
01:45:16 Finished Proc 243
01:45:16 Finished Proc 244
01:45:16 Finished Proc 245
01:45:17 Finished Proc 246
01:45:17 Finished Proc 247
01:45:17 Finished Proc 248
01:45:17 Finished Proc 249
01:45:18 Finished Proc 250
01:45:18 Finished Proc 251
01:45:18 Finished Proc 252
01:45:18 Finished Proc 253
01:45:19 Finished Proc 254
01:45:19 Finished Proc 255
01:45:19 Finished Proc 256
01:45:19 Finished Proc 257
01:45:20 Finished Proc 258
01:45:20 Finished Proc 259
01:45:20 Finished Proc 260
01:45:21 Finished Proc 261
01:45:21 Finished Proc 262
01:45:21 Finished Proc 263
01:45:21 Finished Proc 264
01:45:22 Finished Proc 265
01:45:22 Finished Proc 266
01:45:22 Finished Proc 267
01:45:23 Finished Proc 268
01:45:23 Finished Proc 269
01:45:23 Finished Proc 270
01:45:23 Finished Proc 271

real	0m41.821s
user	0m3.350s
sys	0m37.407s

So it appears to work normally for the first ~60 processors before performance takes a drop.

Here is the code I pulled out and sprinkled with timestamp steps.

#!/bin/sh

function nhc_hw_gather_data() {
    local LINE CORES SIBLINGS MHZ PROCESSOR PHYS_ID PORT INDEX DEV
    local -a FIELD PHYS_IDS

    # Gather CPU info
    PROCESSOR=-1
    if [[ -e /proc/cpuinfo ]]; then
        while read -a FIELD ; do
            if [[ "${FIELD[0]} ${FIELD[1]}" = "processor :" ]]; then
                : $((PROCESSOR++))
            elif [[ "${FIELD[0]} ${FIELD[1]} ${FIELD[2]}" = "physical id :" ]]; then
                PHYS_ID=${FIELD[3]}
                if [[ -z "${PHYS_IDS[$PHYS_ID]}" ]]; then
                    PHYS_IDS[$PHYS_ID]=1
                    : $((HW_SOCKETS++))
                fi
            elif [[ "${FIELD[0]} ${FIELD[1]}" = "siblings :" ]]; then
                SIBLINGS=${FIELD[2]}
            elif [[ "${FIELD[0]} ${FIELD[1]} ${FIELD[2]}" = "cpu cores :" ]]; then
                CORES=${FIELD[3]}
            elif [[ "${FIELD[0]} ${FIELD[1]} ${FIELD[2]}" = "cpu MHz :" ]]; then
                MHZ="${FIELD[3]}"
                MHZ="${MHZ/%.*}"
                echo "$(/bin/date "+%T") Finished Proc $PROCESSOR"
            fi
        done < /proc/cpuinfo
    fi
}

time nhc_hw_gather_data

wilshire461 avatar Apr 07 '17 07:04 wilshire461

the solution we are going with currently at our site is we are dumping the first and last cpu info from /proc/cpuinfo in a file and then reading that in so that the if/while block only has to go through 52 lines of cpu information.

A cleaner solution might involve just querying lscpu (though this adds an external dependency 👎 )

wilshire461 avatar Apr 11 '17 19:04 wilshire461

On the 64-core KNL, the /proc/cpuinfo file is about 257K, which should only take about 4 seconds to go through if it were a regular file [tried this with a copy of cpuinfo in /tmp]. I think accessing the /proc file system directly with that pattern is very corner-case and not optimized.

A couple [bash] replacements for the "done < /proc/cpuinfo" line:

done < <(cat /proc/cpuinfo)

takes about 4 seconds, and

done < <(/usr/bin/grep "processor\|physical id\|siblings\|cpu cores\|cpu MHz" /proc/cpuinfo)

takes about 1.5 seconds.

I suspect the speedup is from only reading from /proc once.

Edit: added pull request #47

NateCrawford avatar Sep 21 '17 21:09 NateCrawford

Hi,

I just ran into this on an SGI UV300 with 720 threads, patch I am using is below (sorry, not a git person). This drops an nhc run from over 2 minutes to around 17 seconds for us.

--- lbnl_hw.nhc.orig	2017-10-29 10:06:07.753153308 -0700
+++ lbnl_hw.nhc	2017-10-29 10:08:56.430120437 -0700
@@ -29,6 +29,7 @@
 
     # Gather CPU info
     PROCESSOR=-1
+    CPUINFO=$(</proc/cpuinfo)
     if [[ -e /proc/cpuinfo ]]; then
         while read -a FIELD ; do
             if [[ "${FIELD[0]} ${FIELD[1]}" = "processor :" ]]; then
@@ -47,7 +48,7 @@
                 MHZ="${FIELD[3]}"
                 MHZ="${MHZ/%.*}"
             fi
-        done < /proc/cpuinfo
+        done < <(echo "$CPUINFO")
     fi
     if [[ $PROCESSOR -ge 0 && $HW_SOCKETS -eq 0 ]]; then
         HW_SOCKETS=$((PROCESSOR+1))

griznog avatar Oct 29 '17 17:10 griznog

I see this on moderately busy dual socket EPYC nodes with 48 Cores/socket, no hyper threading. I can demonstrate it with time while read -a LINE; do : ; done </proc/cpuinfo real 0m44.246s user 0m0.429s sys 0m41.745s

compared to time while read -a LINE; do : ; done < <(/usr/bin/cat /proc/cpuinfo) real 0m1.374s user 0m0.474s sys 0m0.949s

There was a long ago thread about /proc/cpuinfo being slow to read line by line taking a lot of kernel time, which "sys" in the two cases above seems to confirm. This is on RHEL 7/CentOS 7.

azric avatar Sep 16 '21 14:09 azric

Based on testing and feedback, #121 has addressed this issue sufficiently to warrant its closure; however, if your own testing or deployment experience(s) differ, please do reopen this one, or a new one, at your discretion! 😃

mej avatar Apr 10 '23 21:04 mej