intgemm Intgemm is not compiling with icc (>16).

While compiling intgemm with one of the latest versions of the ICC (icpc (ICC) 19.1.3.304) I got the following result:

benchmarks/../intgemm/callbacks/implementations.inl(47): error #3632: "target" attribute on special function is not supported
    CPU_ATTR void operator()(vi input, const OutputBufferInfo& info) {
                  ^

as far as I know the attribute ((target ("sse2/sse3/avx2"))) that is used here is a "gnu extension" to the c/c++ programming language and it is not working with ICC compiler.

Jan 04 '21 18:01 akarbown

Does 65276ad59ab9cd5b2bc623c2411f481f79aa7c5c fix this?

Jan 04 '21 20:01 kpu

@sidkashyap-at-Intel Can you setup CI for icc?

Jan 04 '21 20:01 kpu

Does 65276ad fix this?

I cherry-picked the intgemm, compiled it and got the following output: compile.log.

Jan 04 '21 22:01 akarbown

Sigh there's a bug in the Intel compiler in which constructors with target attributes compile but fail to link. Minimal example:

/* Compiles then fails to link */
class Foo {
  public:
  __attribute__ ((target ("avx2"))) Foo() {
  }
};

int main() {
  Foo a;
}

Command:

kheafield@var:~$ . /opt/intel/compilers_and_libraries_2020.4.304/linux/bin/compilervars.sh intel64
kheafield@var:~$ icpc bug.cc 
/tmp/icpcqF2ZaF.o: In function `main':
bug.cc:(.text+0x30): undefined reference to `Foo::Foo()'

Whereas a regular method compiles and links fine:

/* Works */
class Foo {
  public:
    Foo() {}
    __attribute__ ((target ("avx2"))) void Bar() {
    }
};

int main() {
  Foo a;
  a.Bar();
}

@akarbown it looks like you work for Intel? Can you help with icc bugs? We collaborate with @sidkashyap-at-Intel.

Jan 05 '21 00:01 kpu

@XapaJIaMnu says he tried to report this to Intel two years ago via @sidkashyap-at-Intel but couldn't figure out how to raise it to the team.

Jan 05 '21 00:01 kpu

This issue was indeed discussed with the team; @akarbown, will ping you offline to take this forward.

@kpu: Yes, will setup a CI for icc.

Jan 05 '21 00:01 sidkashyap-at-Intel

Sigh there's a bug in the Intel compiler in which constructors with target attributes compile but fail to link. Minimal example:
/* Compiles then fails to link */
class Foo {
  public:
  __attribute__ ((target ("avx2"))) Foo() {
  }
};

int main() {
  Foo a;
}
Command:
kheafield@var:~$ . /opt/intel/compilers_and_libraries_2020.4.304/linux/bin/compilervars.sh intel64
kheafield@var:~$ icpc bug.cc 
/tmp/icpcqF2ZaF.o: In function `main':
bug.cc:(.text+0x30): undefined reference to `Foo::Foo()'
Whereas a regular method compiles and links fine:
/* Works */
class Foo {
  public:
    Foo() {}
    __attribute__ ((target ("avx2"))) void Bar() {
    }
};

int main() {
  Foo a;
  a.Bar();
}
@akarbown it looks like you work for Intel? Can you help with icc bugs? We collaborate with @sidkashyap-at-Intel.

It compiles when it looks as follows:

/* Compiles then fails to link */
class Foo {
  public:
  __attribute__ ((target ("avx2"))) Foo() {}
  __attribute__ ((target ("default"))) Foo() {}
//  Foo() {
//  }
};

int main() {
  Foo a;
}

I get to know that it is because the compiler creates a resolver function something like this:

Func foo-resolver
  If (target == avx2) return & foo2
  Else return &foo1

It seems to be pretty interesting that for a regular function it works without the attribute default.

Jan 05 '21 18:01 akarbown

Target attributes pre-date function multiversioning. Function multiversioning is a feature that should only be activated if there are two function declarations with the same signature that differ only in target attributes. The bug in the Intel compiler is that it activates function multiversioning for constructors when there is only one version.

We wanted to use function multi-versioning in intgemm but it's actually quite bad for performance due to adding the CPUID dispatch to all the functions, even in contexts where the target architecture is already known: https://hannes.hauswedell.net/post/2017/12/09/fmv/ . So we do target attributes inside the code then a call-by-function-pointer on the external APIs.

Jan 05 '21 18:01 kpu

I tried to find documentation on icc. To quote page 2212 of https://software.intel.com/content/dam/develop/external/us/en/documents/cpp_compiler_classic.pdf :

On Linux*, in addition to the Intel-defined attributes cpu_specific and cpu_dispatch , C++ compilations with GNU Compiler Collection (GCC*) compatibility 4.8 or higher support creation of multiple function versions using the target attribute. For more information see the GCC documentation on "Function Multiversioning".

So it would seem the defined behavior for icc is to match whatever gcc does. And the gcc documentation https://gcc.gnu.org/wiki/FunctionMultiVersioning states

Determine if two function decls with the same signature are versions.

It further defines this as

Two function decls with the same signature are versions if and only if both are tagged with the function attribute "target" and the target attribute strings differ.

I don't see anything about a constructor being treated differently. And gcc empirically handles a single constructor with a target attribute correctly.

Moreover, icc is moving to a clang-based version of target attributes, which is what we are doing: https://software.intel.com/content/www/us/en/develop/articles/porting-guide-for-icc-users-to-dpcpp-or-icx.html

I tried #pragma intel optimization_parameter target_arch and that might work but it has to appear before template whereas __attribute__ has to appear after template. So it's not a drop-in substitute.

But anyway try 2647d6c ; I just removed the target attributes entirely from icc. Possibly less efficient via icc but deservedly so.

Jan 05 '21 22:01 kpu

Thanks for the temporary solution! I'll try to poke a compiler team in that case.

Jan 07 '21 09:01 akarbown

@akarbown: Have forwarded the meeting minutes with the MKL where we discussed these issues, let us sync offline.

Jan 07 '21 14:01 sidkashyap-at-Intel

I've created jira issue internally. If you have any more questions connected with that issue I think that @mibintc could give you more information or direct you to the proper person.

Feb 16 '21 19:02 akarbown

When the icc compiler sees a target attribute, it thinks you are using gnu function multiversioning. Since your program didn't provide a "default" function, the multiversion definition wasn't complete. The icc compiler really doesn't support the other meaning of target attribute. As I understand it from reading the gcc documentation, __attribute((target("avx")) which is not gnu function multiversioning means that the compiler is entitled to generate avx2 instructions for the function body. If the code was executed on a cpu that didn't support avx2 then you would get an instruction fault. I believe that you could get the effect of avx2 instructions by separating out Foo()'s definition into a separate compilation unit and choose target architecture avx2 using the command line option. Since icc doesn't support the other meaning of target attribute, we could ignore it, I don't know if that would be sufficient to meet your needs. I haven't studied your code base, I just looked at the small example provided.

Feb 16 '21 20:02 mibintc

When the icc compiler sees a target attribute, it thinks you are using gnu function multiversioning. Since your program didn't provide a "default" function, the multiversion definition wasn't complete.

If that is your policy then this should be an error. But it compiles fine.

/* Works */
class Foo {
  public:
    Foo() {}
    __attribute__ ((target ("avx2"))) void Bar() {
    }
};

int main() {
  Foo a;
  a.Bar();
}

The icc compiler really doesn't support the other meaning of target attribute.

Ok. Did I miss some documentation?

As I understand it from reading the gcc documentation, __attribute((target("avx")) which is not gnu function multiversioning means that the compiler is entitled to generate avx2 instructions for the function body. If the code was executed on a cpu that didn't support avx2 then you would get an instruction fault.

Correct. The goal here is to avoid CPUID dispatch for every little internal function call and only do it once at the top-level interface. If every function call had multiversioning, it would be very slow without a compiler that realizes it's already in a achitecture-dependent function and can directly call / inline the next architecture-dependent function.

I believe that you could get the effect of avx2 instructions by separating out Foo()'s definition into a separate compilation unit and choose target architecture avx2 using the command line option.

We have user-defined C++ template arguments. Separate compilation units for each architecture don't really work for that, because it would require our users to separately compile all their code separately for each architecture. Suppose you had some architecture-dependent optimization for std::vector.

Since icc doesn't support the other meaning of target attribute, we could ignore it, I don't know if that would be sufficient to meet your needs. I haven't studied your code base, I just looked at the small example provided.

It appears that icc allows any intrinsic inside any function, regardless of architecture consistency. Therefore this would work for us and be compatible with gcc's behavior, albeit with the possibility of missed optimizations.

Feb 16 '21 21:02 kpu

When the icc compiler sees a target attribute, it thinks you are using gnu function multiversioning. Since your program didn't provide a "default" function, the multiversion definition wasn't complete.

If that is your policy then this should be an error. But it compiles fine.
/* Works */
class Foo {
  public:
    Foo() {}
    __attribute__ ((target ("avx2"))) void Bar() {
    }
};

int main() {
  Foo a;
  a.Bar();
}
The icc compiler really doesn't support the other meaning of target attribute.

Ok. Did I miss some documentation?

As I understand it from reading the gcc documentation, __attribute((target("avx")) which is not gnu function multiversioning means that the compiler is entitled to generate avx2 instructions for the function body. If the code was executed on a cpu that didn't support avx2 then you would get an instruction fault.

Correct. The goal here is to avoid CPUID dispatch for every little internal function call and only do it once at the top-level interface. If every function call had multiversioning, it would be very slow without a compiler that realizes it's already in a achitecture-dependent function and can directly call / inline the next architecture-dependent function. On Linux, with gnu function multiversioning, the resolver function runs when the program is loaded and the call site is patched with the resolved address: the overhead is incurred at load time.

I believe that you could get the effect of avx2 instructions by separating out Foo()'s definition into a separate compilation unit and choose target architecture avx2 using the command line option.

We have user-defined C++ template arguments. Separate compilation units for each architecture don't really work for that, because it would require our users to separately compile all their code separately for each architecture. Suppose you had some architecture-dependent optimization for std::vector.

Since icc doesn't support the other meaning of target attribute, we could ignore it, I don't know if that would be sufficient to meet your needs. I haven't studied your code base, I just looked at the small example provided.

It appears that icc allows any intrinsic inside any function, regardless of architecture consistency. Therefore this would work for us and be compatible with gcc's behavior, albeit with the possibility of missed optimizations. Yes it's true, icc allows any intrinsic regardless of architecture consistency. Thanks for the followup.

Feb 16 '21 22:02 mibintc

@mibintc we intended to make use of function multiversioning initially, but it appears compiler support for it is extremely patchy. Here's a list of bugs against clang, gcc and icc:

from GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90129 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90260 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57378 (especially annoying since it's more than 5 years old) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 (already fixed)

from clang: https://bugs.llvm.org/show_bug.cgi?id=41482 https://bugs.llvm.org/show_bug.cgi?id=41613 https://bugs.llvm.org/show_bug.cgi?id=41614 https://bugs.llvm.org/show_bug.cgi?id=41386

For ICC (inline code snippets, with examples).

__attribute__((target("avx512dq"))) (or avx512bw) not recognized (works on clang/gcc). Workaround is to use avx512f as the target. I am not sure if this means that avx512f permits the emission of every single possible AVX512 instruction or that it forbits avx512bw/dq unless intrinsics are used specifically, but either way it's not the correct behaviour.

#include <xmmintrin.h>
#include <emmintrin.h>
#include <immintrin.h>

//clang and gcc expect the precise instruction set name 
//(eg avx512bw and  avx512f)/ icc only works with avx512f

__attribute__((target("avx512bw"))) static inline __m512 max_ps(__m512 first, __m512 second) {
  return _mm512_max_ps(first, second);
}
// Technically __AVX512DQ__
__attribute__((target("avx512dq"))) static inline __m512 and_ps(__m512 first, __m512 second) {
  return _mm512_and_ps(first, second);
}


int main()
{
    __m512 first, second, res1, res2;
    res1 = max_ps(first, second);
    res2 = and_ps(first, second);
    return 2;
}

"target" attribute on special function is not supported for function operator() (works on clang/gcc) For whatever reasons operator() is not allowed its target attribute. Other class functions (or class constructors) are allowed to use target attribute.

class foo {
public:
    int a;
    foo() {
        a = 1;
    }
    
    __attribute__ ((target("avx2"))) int operator()() {
        return a+2;
    }
    
    
};

int main() {
    foo a = foo();
    return a();
}

function multiversioning doesn't support templated functions We would like to template some functions that are generic between architecture versions, except for the register width. (this one doesn't work on clang/gcc) and is a major functionality that we need should we want to use function multiversioning

template <class T>
__attribute__ ((target("avx2"))) T foo(T num) {
    return num + 1;
}

template <class T>
__attribute__ ((target("default"))) T foo(T num) {
    return num + 2;
}

int __attribute__ ((target("avx2"))) main() {
    
    return foo(1);
}

__attribute__ ((target("avx2"))) on a main() gives undefined reference to main (works on clang/gcc)

__attribute__ ((target("avx2"))) int foo(int num) {
    return num + 1;
}


__attribute__ ((target("default"))) double foo(double num) {
    return num + 2;
}

int __attribute__ ((target("avx2"))) main() {
    
    return foo(1);
}

Feb 16 '21 23:02 XapaJIaMnu

I think markdown indentation ate your post. Can you space it out more?

While it's nice that call sites can be patched at load time, function multiversioning still adds an optimization barrier and prevents inlining. See https://godbolt.org/z/E9o7hP . Input code:

class Specialized {
  public:
    __attribute__((target("avx2"))) Specialized() {}
    __attribute__((target("default"))) Specialized() {}
};
void __attribute__((target("avx2"))) Inefficient() {
    Specialized f;
}
void __attribute__((target("default"))) Inefficient() {
    Specialized f;
}

class NotSpecialized {
  public:
    NotSpecialized() {}
};
void __attribute__((target("avx2"))) OptimizesProperly() {
    NotSpecialized f;
}
void __attribute__((target("default"))) OptimizesProperly() {
    NotSpecialized f;
}

We can see in the output that there's code in Inefficient().avx2 with a whole call. Whereas without the function multiversioning, NotSpecialized::NotSpecialized inlines to nothing.

Inefficient().avx2:
        push      rsi                                           #8.52
        lea       rdi, QWORD PTR [rsp]                          #9.17
        call      Specialized::Specialized() [complete object constructor]                          #9.17
        pop       rcx                                           #10.1
        ret                                                     #10.1
Inefficient():
        push      rsi                                           #12.55
        lea       rdi, QWORD PTR [rsp]                          #13.17
        call      Specialized::Specialized() [complete object constructor]                          #13.17
        pop       rcx                                           #14.1
        ret                                                     #14.1
OptimizesProperly().avx2:
        ret                                                     #23.1
OptimizesProperly():
        ret                                                     #27.1

This might be because the compiler doesn't know if I've made a third avx512f version of Specialized::Specialized

Due to the performance overhead, we've abandoned function multiversioning in favor of separate namespaces for each architecture that call within the same namespace using functions with a single target attribute defined. The compilers are then able to properly inline functions with the same target attribute into each other and we only do dispatch outside.

Feb 16 '21 23:02 kpu

@mibintc I really appreciate you stopping by!

Feb 17 '21 22:02 kpu

BTW I asked the technical support team to help out. I downloaded intgemm and skimmed the source for target attribute. It looks like you are using conditional compilation to avoid using target attribute for MSVC. You could lump the __INTEL_COMPILER into that, thereby avoiding the use of the target attribute, since the target attribute isn't adding anything in the way that you are using it and only causing pain. Even if we fix this issue in icc, (i.e. by more or less ignoring the target attribute) due to the release cadence there would be a frustrating wait. I also want to draw your attention to the next generation Intel C++ compiler (macro __INTEL_LLVM_COMPILER) which is clang based and available for installation without cost. In contrast to icc, the icx compiler does pay attention to target attribute and it does force you to match the target to the intrinsics being used. icx/icpx is in this package: https://software.intel.com/content/www/us/en/develop/tools/oneapi.html

Feb 18 '21 12:02 mibintc

We are actually compiling with the (old) Intel compiler as of 5 January https://github.com/kpu/intgemm/commit/2647d6c129ccb1cef486628685bb80a85158459a (as noted above) and you can see the CI here https://github.com/kpu/intgemm/runs/1919056224 . It works by removing all target attributes from the Intel compiler, like it does for MSVC.

In theory the lack of target attributes prevents the compiler from generating its own instructions. In practice, we don't leave much room for the compiler to do that. Though it might happen for some loads where we just use * instead of an intrinsic.

We're aware of the llvm-based compiler and already support clang-type behavior.

By the way though why doesn't the Intel compiler support target("avx512bw")? Seems very off-brand for Intel.

Feb 18 '21 14:02 kpu

@anoopmad could you please help @kpu with the compiler specific questions that have been raised in this thread? AFAIK, the issue doesn't seem to appear while compiling with icx.

May 07 '21 14:05 akarbown

intgemm intgemm copied to clipboard

Intgemm is not compiling with icc (>16).

intgemm
intgemm copied to clipboard