Corrfunc
Corrfunc copied to clipboard
Build failure - related to avx512?
General information
- Corrfunc version: 2.3.3 or master
- platform: Linux
- installation method (pip/source/other?): pip or make
Issue description
Compile error
Expected behavior
Actual behavior
What have you tried so far?
$ git clone https://github.com/manodeep/Corrfunc/
Cloning into 'Corrfunc'...
[etc]
$ cd Corrfunc
$ make
mkdir -p lib
mkdir -p bin
mkdir -p include
make -C theory
make[1]: Entering directory '/tmp/Corrfunc/theory'
-------COMPILE SETTINGS------------
MAKE = ["make"]
CC = ["gcc"]
OPT = ["-DPERIODIC -DENABLE_MIN_SEP_OPT -DCOPY_PARTICLES -DUSE_OMP"]
CFLAGS = [" -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs "]
CLINK = [" -lrt -fopenmp -lm"]
PYTHON = ["python"]
GSL_CFLAGS = ["-I/cm/shared/apps/gsl/gsl-2.5/include"]
GSL_LINK = ["-L/cm/shared/apps/gsl/gsl-2.5/lib -lgsl -lgslcblas -lm -Xlinker -rpath -Xlinker /cm/shared/apps/gsl/gsl-2.5/lib"]
PYTHON_CFLAGS = ["-isystem/cm/shared/apps/conda-environments/python36/include/python3.6m -isystem /cm/shared/apps/conda-environments/python36/lib/python3.6/site-packages/numpy/core/include/numpy/"]
-------END OF COMPILE SETTINGS------------
make -C DD
make[2]: Entering directory '/tmp/Corrfunc/theory/DD'
gcc -DPERIODIC -DENABLE_MIN_SEP_OPT -DCOPY_PARTICLES -DUSE_OMP -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -I../../io -I../../utils -c DD.c -o DD.o
gcc -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -I../../io -I../../utils -c ../../io/ftread.c -o ../../io/ftread.o
gcc -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -I../../io -I../../utils -c ../../io/io.c -o ../../io/io.o
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" ../../utils/weight_defs.h.src >> ../../utils/weight_defs_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" countpairs_impl.h.src >> countpairs_impl_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" ../../utils/weight_defs.h.src >> ../../utils/weight_defs_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" countpairs_impl.h.src >> countpairs_impl_float.h
gcc -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -I../../io -I../../utils -c countpairs.c -o countpairs.o
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" countpairs_kernels.c.src >> countpairs_kernels_float.c
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" countpairs_kernels.c.src >> countpairs_kernels_double.c
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" ../../utils/gridlink_impl.h.src >> ../../utils/gridlink_impl_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" ../../utils/gridlink_impl.h.src >> ../../utils/gridlink_impl_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" ../../utils/gridlink_utils.h.src >> ../../utils/gridlink_utils_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" ../../utils/gridlink_utils.h.src >> ../../utils/gridlink_utils_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" ../../utils/cellarray.h.src >> ../../utils/cellarray_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" ../../utils/cellarray.h.src >> ../../utils/cellarray_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" ../../utils/cell_pair.h.src >> ../../utils/cell_pair_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" ../../utils/cell_pair.h.src >> ../../utils/cell_pair_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" ../../utils/weight_functions.h.src >> ../../utils/weight_functions_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" ../../utils/weight_functions.h.src >> ../../utils/weight_functions_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g" countpairs_impl.c.src >> countpairs_impl_double.c
gcc -DDOUBLE_PREC -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -I../../io -I../../utils -c countpairs_impl_double.c -o countpairs_impl_double.o
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:43:0,
from ../../utils/avx512_calls.h:16,
from ../../utils/weight_functions_double.h:12,
from countpairs_kernels_double.c:23,
from countpairs_impl_double.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx2intrin.h: In function ‘countpairs_avx512_intrinsics_double’:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx2intrin.h:973:20: error: the last argument must be an 8-bit immediate
return (__m256i) __builtin_ia32_pblendd256 ((__v8si)__X,
^
../../rules.mk:61: recipe for target 'countpairs_impl_double.o' failed
make[2]: *** [countpairs_impl_double.o] Error 1
System information:
$ head -n 2 /etc/os-release
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
$ make -v
GNU Make 4.1
...
$ gcc --version
gcc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
...
$ pkg-config --modversion gsl
2.5
$ python --version
Python 3.6.0 :: Continuum Analytics, Inc.
$ which python
/cm/shared/apps/conda-environments/python36/bin/python
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
stepping : 4
microcode : 0x2000064
cpu MHz : 2683.513
cache size : 14080 KB
physical id : 0
siblings : 20
core id : 0
cpu cores : 10
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 4400.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
... [x40]
Oh, but we do have gcc v9 on this machine, and that seems to work.
Thanks for the report! Glad you have a workaround, but we should get gcc 5 to work anyway. I don't have a gcc 5 environment immediately available, but let me see if I can drum one up... Feel free to leave the issue open in the meantime.
For the record,
$ gcc --version gcc (Ubuntu 9.2.1-17ubuntu1~16.04) 9.2.1 20191102 $ pip install --no-cache-dir Corrfunc
succeeds.
Or update the docs to note this requirement :) Thanks!
Probably related to #183
@dstndstn Could you share the output of gcc -march=native -dM -E - < /dev/null | grep AVX512
with your gcc 5 compiler?
My best guess is that the CPU supports AVX512VL but not the gcc 5 compiler, so we're falling back to the AVX2 version of AVX512_BLEND_INTS_WITH_MASK
. But when we do so, we're trying to use a non-immediate mask operand, which, according to that compiler error, is unsupported.
On the other hand, the gcc 5 release notes suggest it should support AVX512VL. So the preprocessor output will be interesting!
$ gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
$ gcc --version
gcc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
Thanks. So indeed, despite the CPU (and supposedly the compiler??) supporting AVX512VL, the compiler is not trying to use it.
@manodeep Maybe there's a way to detect this scenario manually and pass -mavx512vl
to the compiler in addition to -march=native
. But regardless, I think the AVX2 int blend fallback will not work as written, because it uses non-immediate operands. So we may need to remove that fallback in favor of non-vectorized code, or refactor the part of the kernel that uses it.
Interesting! Looks like both gcc5.5.0
and gcc6.4.0
do not enable AVX512VL with -march=native
[~ @farnarkle2] ml --force purge && ml gcc/5.5.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
[~ @farnarkle2] ml --force purge && ml gcc/6.4.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
[~ @farnarkle2] ml --force purge && ml gcc/7.3.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512VL__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
@dstndstn If you add CFLAGS=-mavx512vl
at the very top of common.mk
, does the code then compile with gcc5.5
?
@lgarrison Perhaps we should print out an info message warning users on AVX512-systems with gcc < 7.3
.
@manodeep If we can detect the scenario, it seems we should just fix it ourselves my adding the -mavx512vl
flag. The flag is mentioned in the release notes: https://gcc.gnu.org/gcc-5/changes.html
But not in the corresponding doc page: https://gcc.gnu.org/onlinedocs/gcc-5.5.0/gcc/x86-Options.html#x86-Options
My guess it is it's supported, just not automatically added with the -march=native
flag. Does gcc 5 allow it for you on farnakle?
Regardless, I think the AVX2 fallback has a real bug.
If I do (with gcc 5.5.0)
export CFLAGS=-mavx512vl
make
it gets further but dies with
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" countpairs_rp_pi_impl.c.src >> countpairs_rp_pi_impl_float.c
gcc -DNDOUBLE_PREC -mavx512vl -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -I../../io -I../../utils -c countpairs_rp_pi_impl_float.c -o countpairs_rp_pi_impl_float.o
In file included from ../../utils/weight_functions_float.h:12:0,
from countpairs_rp_pi_kernels_float.c:23,
from countpairs_rp_pi_impl_float.c:22:
countpairs_rp_pi_kernels_float.c: In function ‘countpairs_rp_pi_avx512_intrinsics_float’:
../../utils/avx512_calls.h:175:46: warning: implicit declaration of function ‘_mm512_abs_ps’ [-Wimplicit-function-declaration]
#define AVX512_ABS_FLOAT(X) _mm512_abs_ps(X)
^
countpairs_rp_pi_kernels_float.c:210:23: note: in expansion of macro ‘AVX512_ABS_FLOAT’
m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
^
In file included from countpairs_rp_pi_impl_float.c:22:0:
countpairs_rp_pi_kernels_float.c:210:13: warning: nested extern declaration of ‘_mm512_abs_ps’ [-Wnested-externs]
m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
^
countpairs_rp_pi_kernels_float.c:210:21: error: incompatible types when assigning to type ‘__m512 {aka __vector(16) float}’ from type ‘int’
m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
^
../../rules.mk:64: recipe for target 'countpairs_rp_pi_impl_float.o' failed
Thanks. I think that one is probably this bug: https://stackoverflow.com/questions/51290930/why-does-gcc-not-provide-the-avx512f-intrinsic-for-absolute-value-of-floats?rq=1
Looks like we can work around this one as well. That's two workarounds for gcc 5... but gcc 5.5.0 is only 3 years old, not sure it makes sense to drop support!
gcc
is terrible when it comes to intrinsics. For instance, we already have a non-standard AVX_ABS_FLOAT
in avx_calls.h
here. It is trivial to do the same for AVX512_ABS_FLOAT
or better yet, directly take the implementation from the gcc patch listed in the link above.
I am wondering if we should report these two bugs (missing intrinsics, missing __AVX512VL__
under -march=native
on SKX cpus) to gcc
.
@lgarrison I agree with you - we should be able to support compilers that are only 3 years old.
After looking through gcc docs (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options), apparently -march=skylake
does NOT add any AVX512 instructions. For that, we need -march=skylake-avx512
; and that target architecture is not available on gcc5 (only gcc6 or higher). Since we are seeing quite a few of the AVX512 isa's enabled by -march=native
, I am uncertain what -march-native
is translating to on a SKX cpu (something in between skylake
and skylake-avx512
).
I think there's 3 things to do here:
- Detect if the CPU and compiler support
-mavx512vl
but__AVX512VL__
is not present in the preprocessor output for-march=native
. If so, add-mavx512vl
toCFLAGS
. - Provide
_mm512_abs_ps()
if it is not present (needs more thought, how do we detect if this is defined? Does the system library define a macro? Or should we always just use our own implementation?) - Remove the AVX2 fallback, If I'm reading the error messages correctly, I don't think it will work under any circumstances.
@manodeep, does this all sound right to you?
This will require a bit of thinking so that the complexity does not go up (and that we are not implementing configure
like capabilities into Corrfunc
). The original reason to only target AVX512F
was to make sure that Corrfunc runs on both SKX+ and KNL+ hardware. However, the code currently fails to compile on KNL for this same issue (#183)
For the abs_ps
, icc/clang/gcc >= 7.3
always provide the intrinsics (I think), and therefore we could simply use a custom one for lower gcc versions.
The AVX2 definitely is incorrect and needs to be replaced with _mm512_mask_blend_pd
+ some no-op casts from _m512i -> _m512d -> _m512i
to keep the compiler happy. (Related, I am not sure we need _m256i
- we could just use _m512i
and the remaining 4 bins are all identically set to 0. We might just need to grab the lower half of the zmm
register or however the array indexing works...). The upside could be that we fix the KNL build failure and manage to run Corrfunc on that