cpufeature Floating point exception when calling cpufeature.print

Hi,

The following code gives a floating point exception under RHEL 7.9 under x86-64 using Intel Xeon -

python3 -c "import cpufeature; cpufeature.print_features()"

I noticed from https://github.com/robbmcleod/cpufeature/issues/11 that you added support for running under valgrind to catch the issue and looking for feedback. And I just thought I will share the details I found to get started -

Using 'run_valgrind.sh' I didn't see any information of the core dump, and this printed cpu features -

buildbot@pdx-rheld7-lv01 stu]$ sh run_valgrind.bash === CPU FEATURES === VendorId : GenuineIntel num_virtual_cores : 4 num_physical_cores : 4 num_threads_per_core : 1 num_cpus : 0 cache_line_size : 64 cache_L1_size : 32768 cache_L2_size : 262144 cache_L3_size : 6291456 OS_x64 : True OS_AVX : True OS_AVX512 : False MMX : True x64 : True ABM : False RDRAND : False BMI1 : False BMI2 : False ADX : False PREFETCHWT1 : False MPX : False SSE : True SSE2 : True SSE3 : True SSSE3 : True SSE4.1 : True SSE4.2 : True SSE4.a : False AES : True SHA : False AVX : True XOP : False FMA3 : False FMA4 : False AVX2 : False AVX512f : False AVX512pf : False AVX512er : False AVX512cd : False AVX512vl : False AVX512bw : False AVX512dq : False AVX512ifma : False AVX512vbmi : False AVX512vbmi2 : False AVX512vnni : False

I then tried to run using gdb, and this gave the following -

`[buildbot@pdx-rheld7-lv01 stu]$ ./gdb python3 'import sitecustomize' failed; use -v for traceback GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /nfs/software/lib/Linux-x86_64/python-3.8.10/bin/python3.8...done. (gdb) run Starting program: /software/lib/Linux-x86_64/python-3.8.10/bin/python3 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [Detaching after fork from child process 3060] Python 3.8.10 (default, Jun 2 2021, 17:11:25) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import cpufeature Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /home/buildbot/build-099/internal/lib/python3.8/site-packages/cpufeature/extension.cpython-38-x86_64-linux-gnu.so]

Program received signal SIGFPE, Arithmetic exception. 0x00007fffeef2e35a in detect_cores () from /home/buildbot/build-099/internal/lib/python3.8/site-packages/cpufeature/extension.cpython-38-x86_64-linux-gnu.so Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 glibc-2.17-326.el7_9.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 nss-softokn-freebl-3.79.0-4.el7_9.x86_64 xz-libs-5.2.2-2.el7_9.x86_64 zlib-1.2.7-21.el7_9.x86_64 (gdb) quit`

Let me know what information you need to debug the issue.

Jul 11 '23 15:07 kartlee

Sounds like some sort of ABI mismatch, which wheel are you using? We have both manylinux and muslinux wheels:

https://pypi.org/project/cpufeature/#files

Please try downloading the source and compiling it yourself and see if you have the same problem?

Jul 11 '23 15:07 robbmcleod

Hi Rob,

The issue seems to be coming with gdb version. And I now tried with 13.1 and has a proper output with line number -

`[buildbot@pdx-rheld7-lv01 stu]$ gdb python3 GNU gdb (GDB) 13.1 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python3... (gdb) run Starting program: /nfs/software/lib/Linux-x86_64/python-3.8.10/bin/python3 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [Detaching after fork from child process 8140] Python 3.8.10 (default, Jun 2 2021, 17:11:25) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import cpufeature

Program received signal SIGFPE, Arithmetic exception. 0x00007fffef67835a in detect_cores () at cpufeature/cpu_x86.c:119 119 cpufeature/cpu_x86.c: No such file or directory.`

The version of cpufeature I am using is 0.2.0, fyi

-Karthik

Jul 11 '23 16:07 kartlee

Can you try 0.2.2 and give me the line number? I don't really have the bandwidth to support older versions.

Jul 11 '23 17:07 robbmcleod

Here is the result from 0.2.2 -

[buildbot@pdx-rheld7-lv01 stu]$ pip show cpufeature
Name: cpufeature
Version: 0.2.2
Summary: Python CPU Feature Detection
Home-page: http://github.com/robbmcleod/cpufeature
Author: Robert A. McLeod
Author-email: [email protected]
License: https://creativecommons.org/publicdomain/zero/1.0/legalcode
Location: /scr/buildbot/stu/tmp
Requires:
Required-by:
[buildbot@pdx-rheld7-lv01 stu]$ gdb python3
GNU gdb (GDB) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(gdb) run
Starting program: /nfs/software/lib/Linux-x86_64/python-3.8.10/bin/python3
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Python 3.8.10 (default, Jun  2 2021, 17:11:25)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cpufeature

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7fed35a in detect_cores () at cpufeature/cpu_x86.c:119
119     cpufeature/cpu_x86.c: No such file or directory.

Jul 11 '23 17:07 kartlee

Perhaps it's a division-by-zero error. You could include an output statement to check if procPerCore is zero.

        procPerCore = info1[1] & 0xFFFF;
        logicalProc = info2[1] & 0xFFFF;
        printf("procPerCore: %d, logiacalProc: %d\n", procPerCore, logicalProc);
        physicalProc = logicalProc / procPerCore;

Or use the debugger to inspect the values. I suspect since the Xeon is a bit old (apparently it reports the number of CPUs as zero as well) that it's zero. I can put in a ternary statement to force it to be one if it's <= 0.

Edit: Actually logicalProc may also be zero, 0/0 is also an error. That's important as well for the number of CPU calculation on line 152. That would be weird though, for the CPU to not be reporting logical processor count.

Jul 11 '23 18:07 robbmcleod

Hey Rob,

I enabled the printf https://github.com/robbmcleod/cpufeature/blob/master/cpufeature/cpu_x86.c#L114 and added one you asked, and got the following -

x0B,0x00: Processors: 198384, 133120, -1629998589, 529267711
x0B,0x01: Cores:      0, 0, 0, 0
procPerCore: 0, logiacalProc: 0

Also attached lscpu output, for your record -

[buildbot@pdx-rheld7-lv01 stu]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz
Stepping:              0
CPU MHz:               2693.671
BogoMIPS:              5387.34
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              39424K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon nopl tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm rsb_ctxsw arat

Jul 11 '23 19:07 kartlee

That's strange, your processor is from 2020, it should work fine (I thought it was much older). I may have to dig into the Intel manual again to see if there are any changes, but I honestly can't see Intel breaking backward compatibility. But maybe they did and we have to provide a different hex id to the calls,

        cpuid(info1, 0x0B, 0x00);
        cpuid(info2, 0x0B, 0x01);

lscpu seems to be reporting some wrong values too. The datasheet says you have 28 cores (56 virtual), is that what you expect?

https://ark.intel.com/content/www/us/en/ark/products/199350/intel-xeon-gold-6258r-processor-38-5m-cache-2-70-ghz.html

Is there any virtualisation or containerization happening here in the environment you are running Python from?

I just made a push that protects against the divide-by-zero error, you can build that one and see what the output is to see if any other parts of the CPU spec are erroneous.

Jul 11 '23 20:07 robbmcleod

The 'Hypervisor Vendor' for the machine where the exception happen shows as 'VMware'. So I assume its a virtual one. We have another machine with same processor type, and VMware virtualization, and the cpufeature works fine in this case. Attached lscpu info of the machine which works -

[rajagopa@pdx-rhel8-lv01 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz
Stepping:            0
CPU MHz:             2693.671
BogoMIPS:            5387.34
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            39424K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat md_clear flush_l1d arch_capabilities

I will try the fix, and provide you feedback soon.

Jul 12 '23 17:07 kartlee

I've found it's impossible to know in advance if cpuid is implemented properly by a virtual machine or not. If the VM is returning nonsense on these calls, then cpufeature is going to return nonsense as well. I think I will put a disclaimer in the documentation.

Jul 14 '23 18:07 robbmcleod

I still see the crash with the fix checked in.

Program received signal SIGFPE, Arithmetic exception.
detect_cores () at cpufeature/cpu_x86.c:155
155     cpufeature/cpu_x86.c: No such file or directory.

You might want to protect here too - https://github.com/robbmcleod/cpufeature/blob/master/cpufeature/cpu_x86.c#L155

Thanks for adding the WRNI about the virtualization issue.

-Karthik

Jul 18 '23 15:07 kartlee

I fixed the other potential div/0 errors. I thought I had fixed them as well. I probably forgot to save my changes before committing.

Jul 22 '23 16:07 robbmcleod

cpufeature
cpufeature copied to clipboard

Floating point exception when calling cpufeature.print_features(..) in RHEL 7.9

cpufeature cpufeature copied to clipboard

Floating point exception when calling cpufeature.print_features(..) in RHEL 7.9

cpufeature
cpufeature copied to clipboard