OpenBLAS Clarifying naming of compute kernels prior to submission of PR

I'm currently implementing the FP16 kernels as a follow up to #3754 and #2767

I want to be 100% clear on OpenBLAS nomenclature prior to submitting my PR.

Baseline assumptions

Where name is XYgemm_kernel_NxM_UARCH.c

and

X defines the returning data type Y defines the submitted data type N defines the column of the matrix, each column being the columnar access offset of a given register/vector M defines the in register vector width (so for FP16 in a 512 bit vector lanes, 32 entries), while also being the row of the matrix

X and Y can be any of the following s=single precision d=double b=bf16 h=fp16 c=complex single z=complex double null or no prefix

Also for the sake of reference ge is general m is matrix v is vector e: sdgemv would be an operation that takes in double precision floats in a matrix and vectors and returns in single precision format

UARCH is either the first UARCH for which the compute kernel is targeted, or if one already exist but a new UARCH which supports new extensions has come out and has favorable additional instructions that can further optimize a process, uses the name of the new UARCH instead. In some cases the lake/bridge suffix is dropped (sandy instead of sandy bridge). Sometimes the consumer/workstation name is used instead of the architecture name (skylakex instead of skylake)

This is assumed from the following:

The project has 6 implementations of a x86_64 single precision (fp32) 16x4 kernel.

3 in c with inline asm, 3 in pure asm.

Looking only to the ASM versions we have: sgemm...sandy.S which is where AVX(1) instructions were introduced. sgemm...haswell.S which is where AVX2 instructions were introduced. sgemm...skylakex.S which is where AVX512 instructions were introduced.

Kernel1: legacy

f16c and AVX were first introduced together on the Ivybridge micro-architecture and use the 32bit vector registers to compute then convert to fp16. AVX1 allows for YMM registers if we stick to floating point mode, which we're doing, so all good on that front.

As such we can use YMM registers, but need to treat them in 32 bit increments. This would mean a maximum of Nx8 kernel, with a naming convention of: hgemm_kernel_NxM_ivy(bridge?).c

Where M could be 1,2,4,8, but probably only 8, N probably only 4, 8 and 16.

A potential issue is that the spec of f16c defines that the values be converted to fp32 before computation->computed->then computed back. it would therefore only be applicable for testing/development purposes before sending to modern systems capable of utilizing the AVX512 implementation.

Kernel 2: The modern implementation (read: the fast one)

In the case of the AVX512 implementation things are a little different. The first implementation in market was unofficial/unsanctioned via Alderlake (see #3490) over a year in market prior to Sapphire rapids on the performance core's (named Golden Cove). It supports proper FP16 implementation, therefore supporting 32 values in the 512bit ZMM registers.

The compromise I've come to is to name the AVX512 versions:

`hgemm_kernel_NxM_goldencove.c

Where the implementations will be M=4,8,16 and N = 16, 32.

Once all of these are done it may be viable/workable to create SH/HS as well as the FP16 complex kernels but that's for later.

Sep 06 '22 17:09 FCLC

Correct on the naming scheme, and with each MxN blocked GEMM kernel implementation in its own file (and matching GEMM_UNROLL_M, GEMM_UNROLL_N parameters in respective cpu section of param.h), but only one of them required/active. (Where OpenBLAS has several for a particular cpu, this reflects development history) Legacy implementation would best be in plain C "generic" kernel, if standard prescribes moving to/from fp32 registers anyway it is probably possible to add this into the existing gemm_kernel_2x2 (and icopy/ocopy helpers) like is done for bfloat16. Not sure if it would be worth it to add an AVX kernel, but if you do the "logical" name would be sandy(bridge) as Ivy Bridge was just a die shrink of the same (and consequently does not exist as an OpenBLAS TARGET). For the AVX512 kernel, you could use the "spr" suffix introduced by guowangy's SBGEMM kernel for SapphireRapids. (Again, AlderLake does not exist as a TARGET in its own right and Intel's "now you see 'em - now you don't" approach to AVX512 availability on that platform does not help either)

Sep 06 '22 18:09 martin-frbg

Correct on the naming scheme, and with each MxN blocked GEMM kernel implementation in its own file (and matching GEMM_UNROLL_M, GEMM_UNROLL_N parameters in respective cpu section of param.h), but only one of them required/active. (Where OpenBLAS has several for a particular cpu, this reflects development history)

To clarify, this means that for skylake, kernels (implied 1), 2 and 3 are compiled, but only 3 is used as it is the fastest most modern implementation? As such, if in a year from now a new implementation of the hgemm... were to come along, it would be named hgemm...2 instead of supplanting/replacing the current current file?

Legacy implementation would best be in plain C "generic" kernel, if standard prescribes moving to/from fp32 registers anyway it is probably possible to add this into the existing gemm_kernel_2x2 (and icopy/ocopy helpers) like is done for bfloat16.

This may not be possible without mandating gcc-12/clang-15 since those are the first compilers to officially support the __Float16 type natively per the IEEE 754 fp16 spec.

see https://releases.llvm.org/15.0.0/tools/clang/docs/ReleaseNotes.html#x86-support-in-clang and https://www.gnu.org/software/gcc/gcc-12/changes.html (under the IA-32/x86-64 target. )

In mixed C+asm we can get away faking the types to a certain extent, but it won't be the cleanest thing in the world.

In pure asm we can probably keep it pretty clean.

Not sure if it would be worth it to add an AVX kernel, but if you do the "logical" name would be sandy(bridge) as Ivy Bridge was just a die shrink of the same (and consequently does not exist as an OpenBLAS TARGET)

Unfortunately Sandy cannot be the target, as it lacks the f16c (floating point convert to 16bit) instructions. They were introduced with Ivybridge. They work on YMM and XMM registers in AVX1 but was not updated to AVX2 (since YMM floating point was already supported in AVX1) therefore we could name after Haswell, but it's inaccurate, and Sandy would be unsupported.

See https://en.wikipedia.org/wiki/F16C#CPUs_with_F16C for more information on supported chips

Perhaps name it ivy for accuracy but link it to Haswell targets and up?

For the AVX512 kernel, you could use the "spr" suffix introduced by guowangy's SBGEMM kernel for SapphireRapids. (Again, AlderLake does not exist as a TARGET in its own right and Intel's "now you see 'em - now you don't" approach to AVX512 availability on that platform does not help either)

Yeah, I remember those discussions in the PR for #3527 and in #3490 amongst others.

(speaking of which, might be time to break out spr once these are in and #3754 is reviewed)

Sep 06 '22 20:09 FCLC

To clarify, this means that for skylake, kernels (implied 1), 2 and 3 are compiled, but only 3 is used as it is the fastest most modern implementation?

No, only the one named in KERNEL.SKYLAKEX is compiled - these KERNEL.cpuname files contain definitions in Makefile syntax and are included from the Makefiles in kernel/ (and the cmake build files contain a simple parser that extracts the same information from them). There is no strict standard regarding "evolution" of kernels - for small changes, it is not unusual to make them in place. When introducing a new one with "better" blocking however, it is just more convenient to keep the old kernel around instead of deleting it - it may support some corner case overlooked in the new one, or it may still be appropriate for some related cpu.

Regarding the "old hardware" kernel, it is a bit unfortunate if it cannot be provided for all (x86_64) architectures (think DYNAMIC_ARCH builds). But given that Haswell is nine years old already, your "ivy" kernel will probably be sufficient.

Sep 07 '22 13:09 martin-frbg

No, only the one named in KERNEL.SKYLAKEX is compiled - these KERNEL.cpuname files contain definitions in Makefile syntax and are included from the Makefiles in kernel/ (and the cmake build files contain a simple parser that extracts the same information from them). There is no strict standard regarding "evolution" of kernels - for small changes, it is not unusual to make them in place. When introducing a new one with "better" blocking however, it is just more convenient to keep the old kernel around instead of deleting it - it may support some corner case overlooked in the new one, or it may still be appropriate for some related cpu.

Ok that makes more sense, thank you.

Regarding the "old hardware" kernel, it is a bit unfortunate if it cannot be provided for all (x86_64) architectures (think DYNAMIC_ARCH builds). But given that Haswell is nine years old already, your "ivy" kernel will probably be sufficient.

For the sake of (exhaustive) clarity there's a way that it can be done using SSE2, but this requires multiples rounding operations per set of vectors (on the order of 5 or six per SIMD lane, so slow as to be pointless) it uses the XMMs in FP32 mode and has to do individual bit manipulations per entry.

AKA is basically useless.

technically it can be used with gcc-12/clang-15+ using the _Float16 keyword, but this unfortunately falls back on the SSE2 implementation for all chips that do not support AVX512 FP16, including those that can shortcut using f16c.

TLDR;

If we enforce gcc-12/clang15+ we can have a portable C based FP16 BLAS kernel that works on everything that supports SSE2 and up. This implementation will be very slow, but will be something. It does seem weird that someone with a decades+ CPU would be running the latest and greatest compiler and have this need, but edge cases do exist?

Ancient hardware, leading edge SW

For all chips since Ivy bridge (and therefore compilers released around that time ~2012) we can use AVX+f16c to have ok but not amazing performance characteristics. It will however work pretty much anywhere on anything from the last ten years. Ivybridge+

old to new hardware old to new software

For any AVX512FP16 enabled chips and OpenBLAS installs using Clang14+or gcc12+ we can use an AVX512fp16 kernel which is wicked fast. SPR+/goldencoveAVX512+

bleeding/leading edge hardware and software

Sep 07 '22 14:09 FCLC

Hey @martin-frbg trying to get ahead of things, is their a convention for what we'd call a half float complex domain kernel?

Currently we have Zgemm for double complex and cgemm for single precision complex.

hgemm for half precision normals has been decided on.

I'd like to propose y for fp16 complex, which would take the form of ygemm, ygemv, ydot etc.

Searching through various archives like Arxiv, Google, DuckDuckGo, git hub and so on only yielded a single result for the word ygemm being used, specifically https://github.com/nwchemgit/nwchem/issues/90

I've tagged the assigned dev on another service and am hoping for a reply. I am also reaching out to friend/contacts in HPC and posting publicly to see if a name does already exist, and if so what it is.

This can serve as a temp until such a time as an official name is chosen.

Sep 10 '22 19:09 FCLC

None that I know of (and IIRC #2767 did not help in this regard but I think either w or y would be available - again IIRC nobody opposed my half-serious proposal of v for bf16 complex). Not sure if the "ygemm" in that nwchem ticket was more than a typo (e.g. Dvorak keyboards have D and Y rather close together and the code snippet replied to had dgemm encapsulated as nwdgemm )

Sep 10 '22 20:09 martin-frbg

None that I know of (and IIRC #2767 did not help in this regard but I think either w or y would be available - again IIRC nobody opposed my half-serious proposal of v for bf16 complex). Not sure if the "ygemm" in that nwchem ticket was more than a typo (e.g. Dvorak keyboards have D and Y rather close together and the code snippet replied to had dgemm encapsulated as nwdgemm )

Alright, in that case I'll adopt ygemm for complex half floats.

using v for bfloat complex is fine, but the x86 ISA doesnt have any instructions specific to bfloat like it does for doubles, singles and half floats. Don't thinks neon/SVE/SME on the arm side does either.

What's the timeline I should be working towards for having times to review the kernels in time for next release?

Sep 10 '22 21:09 FCLC

No date set for a next release yet, possibly late november to keep to some half-decent schedule but may drag into next year depending on spare time. OTOH if all goes well your kernels should not be able to break anything not of your own doing anyway, and the interface codes and Makefile stuff can be copypasted from what was put in place for bfloat16.

Sep 10 '22 22:09 martin-frbg

No date set for a next release yet, possibly late november to keep to some half-decent schedule but may drag into next year depending on spare time. OTOH if all goes well your kernels should not be able to break anything not of your own doing anyway, and the interface codes and Makefile stuff can be copypasted from what was put in place for bfloat16.

You should get an email from me addressed to the folks who maintain the BLAS spec.

There might be a slight snafu to deal with RE the H prefix. Seems the spec currently makes use of the prefix for hermitian variants of matrix multiplies. Technically this is a sub-prefix, always following the primary prefix (see https://netlib.org/blas/#_blas_routines for more) but it could prove problematic.

Sep 11 '22 19:09 FCLC

If you review #2767 I think you will find that future function naming is expected to be much more verbose than was possible in the early days of Fortran66 and there did not seem to be much interest in old-style names going forward. So I guess anything that does not directly conflict with an established name should be acceptable (and the level of demand for FP16 seems to be open to debate as well, while BFLOAT16 seems to have large parts of the AI community behind it)

Sep 11 '22 21:09 martin-frbg

OpenBLAS
OpenBLAS copied to clipboard

Clarifying naming of compute kernels prior to submission of PR

Baseline assumptions

Next:

Next:

Kernel1: legacy

Kernel 2: The modern implementation (read: the fast one)

OpenBLAS OpenBLAS copied to clipboard

Clarifying naming of compute kernels prior to submission of PR

Baseline assumptions

Next:

Next:

Kernel1: legacy

Kernel 2: The modern implementation (read: the fast one)

OpenBLAS
OpenBLAS copied to clipboard