intgemm
intgemm copied to clipboard
Intgemm is not compiling with icc (>16).
While compiling intgemm with one of the latest versions of the ICC (icpc (ICC) 19.1.3.304) I got the following result:
benchmarks/../intgemm/callbacks/implementations.inl(47): error #3632: "target" attribute on special function is not supported
CPU_ATTR void operator()(vi input, const OutputBufferInfo& info) {
^
as far as I know the attribute ((target ("sse2/sse3/avx2"))) that is used here is a "gnu extension" to the c/c++ programming language and it is not working with ICC compiler.
Does 65276ad59ab9cd5b2bc623c2411f481f79aa7c5c fix this?
@sidkashyap-at-Intel Can you setup CI for icc?
Does 65276ad fix this?
I cherry-picked the intgemm, compiled it and got the following output: compile.log.
Sigh there's a bug in the Intel compiler in which constructors with target attributes compile but fail to link. Minimal example:
/* Compiles then fails to link */
class Foo {
public:
__attribute__ ((target ("avx2"))) Foo() {
}
};
int main() {
Foo a;
}
Command:
kheafield@var:~$ . /opt/intel/compilers_and_libraries_2020.4.304/linux/bin/compilervars.sh intel64
kheafield@var:~$ icpc bug.cc
/tmp/icpcqF2ZaF.o: In function `main':
bug.cc:(.text+0x30): undefined reference to `Foo::Foo()'
Whereas a regular method compiles and links fine:
/* Works */
class Foo {
public:
Foo() {}
__attribute__ ((target ("avx2"))) void Bar() {
}
};
int main() {
Foo a;
a.Bar();
}
@akarbown it looks like you work for Intel? Can you help with icc bugs? We collaborate with @sidkashyap-at-Intel.
@XapaJIaMnu says he tried to report this to Intel two years ago via @sidkashyap-at-Intel but couldn't figure out how to raise it to the team.
This issue was indeed discussed with the team; @akarbown, will ping you offline to take this forward.
@kpu: Yes, will setup a CI for icc.
Sigh there's a bug in the Intel compiler in which constructors with target attributes compile but fail to link. Minimal example:
/* Compiles then fails to link */ class Foo { public: __attribute__ ((target ("avx2"))) Foo() { } }; int main() { Foo a; }
Command:
kheafield@var:~$ . /opt/intel/compilers_and_libraries_2020.4.304/linux/bin/compilervars.sh intel64 kheafield@var:~$ icpc bug.cc /tmp/icpcqF2ZaF.o: In function `main': bug.cc:(.text+0x30): undefined reference to `Foo::Foo()'
Whereas a regular method compiles and links fine:
/* Works */ class Foo { public: Foo() {} __attribute__ ((target ("avx2"))) void Bar() { } }; int main() { Foo a; a.Bar(); }
@akarbown it looks like you work for Intel? Can you help with icc bugs? We collaborate with @sidkashyap-at-Intel.
It compiles when it looks as follows:
/* Compiles then fails to link */
class Foo {
public:
__attribute__ ((target ("avx2"))) Foo() {}
__attribute__ ((target ("default"))) Foo() {}
// Foo() {
// }
};
int main() {
Foo a;
}
I get to know that it is because the compiler creates a resolver function something like this:
Func foo-resolver
If (target == avx2) return & foo2
Else return &foo1
It seems to be pretty interesting that for a regular function it works without the attribute default.
Target attributes pre-date function multiversioning. Function multiversioning is a feature that should only be activated if there are two function declarations with the same signature that differ only in target attributes. The bug in the Intel compiler is that it activates function multiversioning for constructors when there is only one version.
We wanted to use function multi-versioning in intgemm but it's actually quite bad for performance due to adding the CPUID dispatch to all the functions, even in contexts where the target architecture is already known: https://hannes.hauswedell.net/post/2017/12/09/fmv/ . So we do target attributes inside the code then a call-by-function-pointer on the external APIs.
I tried to find documentation on icc. To quote page 2212 of https://software.intel.com/content/dam/develop/external/us/en/documents/cpp_compiler_classic.pdf :
On Linux*, in addition to the Intel-defined attributes cpu_specific and cpu_dispatch , C++ compilations with GNU Compiler Collection (GCC*) compatibility 4.8 or higher support creation of multiple function versions using the target attribute. For more information see the GCC documentation on "Function Multiversioning".
So it would seem the defined behavior for icc is to match whatever gcc does. And the gcc documentation https://gcc.gnu.org/wiki/FunctionMultiVersioning states
Determine if two function decls with the same signature are versions.
It further defines this as
Two function decls with the same signature are versions if and only if both are tagged with the function attribute "target" and the target attribute strings differ.
I don't see anything about a constructor being treated differently. And gcc empirically handles a single constructor with a target attribute correctly.
Moreover, icc is moving to a clang-based version of target attributes, which is what we are doing: https://software.intel.com/content/www/us/en/develop/articles/porting-guide-for-icc-users-to-dpcpp-or-icx.html
I tried #pragma intel optimization_parameter target_arch
and that might work but it has to appear before template
whereas __attribute__
has to appear after template
. So it's not a drop-in substitute.
But anyway try 2647d6c ; I just removed the target attributes entirely from icc. Possibly less efficient via icc but deservedly so.
Thanks for the temporary solution! I'll try to poke a compiler team in that case.
@akarbown: Have forwarded the meeting minutes with the MKL where we discussed these issues, let us sync offline.
I've created jira issue internally. If you have any more questions connected with that issue I think that @mibintc could give you more information or direct you to the proper person.
When the icc compiler sees a target attribute, it thinks you are using gnu function multiversioning. Since your program didn't provide a "default" function, the multiversion definition wasn't complete. The icc compiler really doesn't support the other meaning of target attribute. As I understand it from reading the gcc documentation, __attribute((target("avx")) which is not gnu function multiversioning means that the compiler is entitled to generate avx2 instructions for the function body. If the code was executed on a cpu that didn't support avx2 then you would get an instruction fault. I believe that you could get the effect of avx2 instructions by separating out Foo()'s definition into a separate compilation unit and choose target architecture avx2 using the command line option. Since icc doesn't support the other meaning of target attribute, we could ignore it, I don't know if that would be sufficient to meet your needs. I haven't studied your code base, I just looked at the small example provided.
When the icc compiler sees a target attribute, it thinks you are using gnu function multiversioning. Since your program didn't provide a "default" function, the multiversion definition wasn't complete.
If that is your policy then this should be an error. But it compiles fine.
/* Works */
class Foo {
public:
Foo() {}
__attribute__ ((target ("avx2"))) void Bar() {
}
};
int main() {
Foo a;
a.Bar();
}
The icc compiler really doesn't support the other meaning of target attribute.
Ok. Did I miss some documentation?
As I understand it from reading the gcc documentation, __attribute((target("avx")) which is not gnu function multiversioning means that the compiler is entitled to generate avx2 instructions for the function body. If the code was executed on a cpu that didn't support avx2 then you would get an instruction fault.
Correct. The goal here is to avoid CPUID dispatch for every little internal function call and only do it once at the top-level interface. If every function call had multiversioning, it would be very slow without a compiler that realizes it's already in a achitecture-dependent function and can directly call / inline the next architecture-dependent function.
I believe that you could get the effect of avx2 instructions by separating out Foo()'s definition into a separate compilation unit and choose target architecture avx2 using the command line option.
We have user-defined C++ template arguments. Separate compilation units for each architecture don't really work for that, because it would require our users to separately compile all their code separately for each architecture. Suppose you had some architecture-dependent optimization for std::vector.
Since icc doesn't support the other meaning of target attribute, we could ignore it, I don't know if that would be sufficient to meet your needs. I haven't studied your code base, I just looked at the small example provided.
It appears that icc allows any intrinsic inside any function, regardless of architecture consistency. Therefore this would work for us and be compatible with gcc's behavior, albeit with the possibility of missed optimizations.
When the icc compiler sees a target attribute, it thinks you are using gnu function multiversioning. Since your program didn't provide a "default" function, the multiversion definition wasn't complete.
If that is your policy then this should be an error. But it compiles fine.
/* Works */ class Foo { public: Foo() {} __attribute__ ((target ("avx2"))) void Bar() { } }; int main() { Foo a; a.Bar(); }
The icc compiler really doesn't support the other meaning of target attribute.
Ok. Did I miss some documentation?
As I understand it from reading the gcc documentation, __attribute((target("avx")) which is not gnu function multiversioning means that the compiler is entitled to generate avx2 instructions for the function body. If the code was executed on a cpu that didn't support avx2 then you would get an instruction fault.
Correct. The goal here is to avoid CPUID dispatch for every little internal function call and only do it once at the top-level interface. If every function call had multiversioning, it would be very slow without a compiler that realizes it's already in a achitecture-dependent function and can directly call / inline the next architecture-dependent function. On Linux, with gnu function multiversioning, the resolver function runs when the program is loaded and the call site is patched with the resolved address: the overhead is incurred at load time.
I believe that you could get the effect of avx2 instructions by separating out Foo()'s definition into a separate compilation unit and choose target architecture avx2 using the command line option.
We have user-defined C++ template arguments. Separate compilation units for each architecture don't really work for that, because it would require our users to separately compile all their code separately for each architecture. Suppose you had some architecture-dependent optimization for std::vector.
Since icc doesn't support the other meaning of target attribute, we could ignore it, I don't know if that would be sufficient to meet your needs. I haven't studied your code base, I just looked at the small example provided.
It appears that icc allows any intrinsic inside any function, regardless of architecture consistency. Therefore this would work for us and be compatible with gcc's behavior, albeit with the possibility of missed optimizations. Yes it's true, icc allows any intrinsic regardless of architecture consistency. Thanks for the followup.
@mibintc we intended to make use of function multiversioning initially, but it appears compiler support for it is extremely patchy. Here's a list of bugs against clang, gcc and icc:
from GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90129 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90260 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57378 (especially annoying since it's more than 5 years old) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 (already fixed)
from clang: https://bugs.llvm.org/show_bug.cgi?id=41482 https://bugs.llvm.org/show_bug.cgi?id=41613 https://bugs.llvm.org/show_bug.cgi?id=41614 https://bugs.llvm.org/show_bug.cgi?id=41386
For ICC (inline code snippets, with examples).
-
__attribute__((target("avx512dq"))) (or avx512bw) not recognized
(works on clang/gcc). Workaround is to use avx512f as the target. I am not sure if this means that avx512f permits the emission of every single possible AVX512 instruction or that it forbits avx512bw/dq unless intrinsics are used specifically, but either way it's not the correct behaviour.
#include <xmmintrin.h>
#include <emmintrin.h>
#include <immintrin.h>
//clang and gcc expect the precise instruction set name
//(eg avx512bw and avx512f)/ icc only works with avx512f
__attribute__((target("avx512bw"))) static inline __m512 max_ps(__m512 first, __m512 second) {
return _mm512_max_ps(first, second);
}
// Technically __AVX512DQ__
__attribute__((target("avx512dq"))) static inline __m512 and_ps(__m512 first, __m512 second) {
return _mm512_and_ps(first, second);
}
int main()
{
__m512 first, second, res1, res2;
res1 = max_ps(first, second);
res2 = and_ps(first, second);
return 2;
}
-
"target" attribute on special function is not supported for function operator()
(works on clang/gcc) For whatever reasonsoperator()
is not allowed its target attribute. Other class functions (or class constructors) are allowed to use target attribute.
class foo {
public:
int a;
foo() {
a = 1;
}
__attribute__ ((target("avx2"))) int operator()() {
return a+2;
}
};
int main() {
foo a = foo();
return a();
}
- function multiversioning doesn't support templated functions We would like to template some functions that are generic between architecture versions, except for the register width. (this one doesn't work on clang/gcc) and is a major functionality that we need should we want to use function multiversioning
template <class T>
__attribute__ ((target("avx2"))) T foo(T num) {
return num + 1;
}
template <class T>
__attribute__ ((target("default"))) T foo(T num) {
return num + 2;
}
int __attribute__ ((target("avx2"))) main() {
return foo(1);
}
-
__attribute__ ((target("avx2")))
on a main() gives undefined reference to main (works on clang/gcc)
__attribute__ ((target("avx2"))) int foo(int num) {
return num + 1;
}
__attribute__ ((target("default"))) double foo(double num) {
return num + 2;
}
int __attribute__ ((target("avx2"))) main() {
return foo(1);
}
I think markdown indentation ate your post. Can you space it out more?
While it's nice that call sites can be patched at load time, function multiversioning still adds an optimization barrier and prevents inlining. See https://godbolt.org/z/E9o7hP . Input code:
class Specialized {
public:
__attribute__((target("avx2"))) Specialized() {}
__attribute__((target("default"))) Specialized() {}
};
void __attribute__((target("avx2"))) Inefficient() {
Specialized f;
}
void __attribute__((target("default"))) Inefficient() {
Specialized f;
}
class NotSpecialized {
public:
NotSpecialized() {}
};
void __attribute__((target("avx2"))) OptimizesProperly() {
NotSpecialized f;
}
void __attribute__((target("default"))) OptimizesProperly() {
NotSpecialized f;
}
We can see in the output that there's code in Inefficient().avx2
with a whole call. Whereas without the function multiversioning, NotSpecialized::NotSpecialized
inlines to nothing.
Inefficient().avx2:
push rsi #8.52
lea rdi, QWORD PTR [rsp] #9.17
call Specialized::Specialized() [complete object constructor] #9.17
pop rcx #10.1
ret #10.1
Inefficient():
push rsi #12.55
lea rdi, QWORD PTR [rsp] #13.17
call Specialized::Specialized() [complete object constructor] #13.17
pop rcx #14.1
ret #14.1
OptimizesProperly().avx2:
ret #23.1
OptimizesProperly():
ret #27.1
This might be because the compiler doesn't know if I've made a third avx512f
version of Specialized::Specialized
Due to the performance overhead, we've abandoned function multiversioning in favor of separate namespaces for each architecture that call within the same namespace using functions with a single target attribute defined. The compilers are then able to properly inline functions with the same target attribute into each other and we only do dispatch outside.
@mibintc I really appreciate you stopping by!
BTW I asked the technical support team to help out. I downloaded intgemm and skimmed the source for target attribute. It looks like you are using conditional compilation to avoid using target attribute for MSVC. You could lump the __INTEL_COMPILER into that, thereby avoiding the use of the target attribute, since the target attribute isn't adding anything in the way that you are using it and only causing pain. Even if we fix this issue in icc, (i.e. by more or less ignoring the target attribute) due to the release cadence there would be a frustrating wait. I also want to draw your attention to the next generation Intel C++ compiler (macro __INTEL_LLVM_COMPILER) which is clang based and available for installation without cost. In contrast to icc, the icx compiler does pay attention to target attribute and it does force you to match the target to the intrinsics being used. icx/icpx is in this package: https://software.intel.com/content/www/us/en/develop/tools/oneapi.html
We are actually compiling with the (old) Intel compiler as of 5 January https://github.com/kpu/intgemm/commit/2647d6c129ccb1cef486628685bb80a85158459a (as noted above) and you can see the CI here https://github.com/kpu/intgemm/runs/1919056224 . It works by removing all target attributes from the Intel compiler, like it does for MSVC.
In theory the lack of target attributes prevents the compiler from generating its own instructions. In practice, we don't leave much room for the compiler to do that. Though it might happen for some loads where we just use * instead of an intrinsic.
We're aware of the llvm-based compiler and already support clang-type behavior.
By the way though why doesn't the Intel compiler support target("avx512bw")
? Seems very off-brand for Intel.
@anoopmad could you please help @kpu with the compiler specific questions that have been raised in this thread? AFAIK, the issue doesn't seem to appear while compiling with icx.