blis
blis copied to clipboard
Add support for float16 (half-precision floats) and related operations such as hgemm()
I am using BLIS for neural networks on embedded platforms (mostly ARMv8a), and I would like to reap the potential memory savings as well as possibly some speedups from running with half-precision floats. Are there any plans to support these in BLIS?
@jacobgorm Thanks for the suggestion. This is something that is in our medium-range plans. Of course, as you probably already know, the complicating factor is that there is no standard C language support for a float16
datatype, so any solution would necessarily not be portable. (In principle, we can add float16
operations, but it would take a non-trivial amount of work. Also, we would need to design things so that the user could disable the system-specific float16
support if it were not available.)
Some useful information from https://github.com/mpi-forum/mpi-issues/issues/65:
- Half-precision floating-point format on Wikipedia.
- ISO/IEC JTC 1/SC 22/WG14 N1945 (ISO C proposal)
- ISO/IEC JTC1 SC22 WG14 N2017 (ISO C++ proposal)
- GCC documentation for Half-Precision Floating Point and Additional Floating Types (e.g.
_Float16
) - Clang/LLVM
_Float16
support for C/C++ commit - Intel® Half-Precision Floating-Point Format Conversion Instructions
- Performance Benefits of Half Precision Floats
@jeffhammond Thank you for taking the time to rustle up these links, Jeff. This will surely prove very useful.
I recommend that BLIS not support float16 but rather bfloat16. The latest research in machine learning suggests that float16 is inferior to bfloat16 for training because of the software and processing overheads associated with handling the limited numerical range associated with a 5-bit exponent.
In any case, implementing both float16 and bfloat16 on hardware that doesn't have native support is relatively easy. In both cases, you use float32 compute. For float16, you can AVX vcvtps2ph
to convert from float16 storage to float32 storage and then do the compute as you would float32 (the latency is 4-7 cycles in the documentation I've found online). For bfloat16, the conversion is trivial, because you just copy the float16 data into the upper half of a float32 register and proceed as before.
It might be possible to reuse the float32 microkernel.
Google recommends the use of bfloat16 with TensorFlow and it is relatively straightforward to understand that it is a better use of bits to have an 8-bit exponent like float32 than the 5-bit exponent used by IEEE float16.
Intel's public statement on bfloat16 is:
Over time, Intel will be extending bfloat16 support across our AI product lines, including Intel Xeon processors and Intel FPGAs. This is part of a cohesive and comprehensive strategy to bring leading AI training capabilities to our silicon portfolio.
Disclaimer: I work for Intel.
Additional references:
- https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats
- https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=vcvtps2ph
- http://developer.amd.com/wordpress/media/2012/03/47414_15h_sw_opt_guide.pdf
@jeffhammond Once again, this was very helpful Jeff. Thank you.
I had never even heard of bfloat16
before today. I can see why it would be preferable (especially for ML/AI applications) given the trade-off between exponent and mantissa.
Yes, bfloat16
is all the rage for inference right now for deciding which bucket to put something in. It's also worth mentioning the 8-bit integer quantization approach taken by https://github.com/google/gemmlowp.
Disclaimer: I sit next to the author at work.
int8 and int16 are usually employed for inference although I’m aware of some efforts to use in training. Not sure if worth the software pain though.
ARMv8.2 defined instructions for FP16 (IEEE format) computations. These are natively supported in Cortex-A55 and Cortex-A75 cores, e.g. Snapdragon 845, with the same per-instruction throughput and 2x FLOPS of FP32 computations.
hi again. Are you guys still considering adding half-precision support to BLIS? FWIW there does seem to be a bit of a hole in the market for portable LA library that supports this. I know of FBGEMM from Facebook but it is x86-only and uses a scary JIT, and last I tested the ARM Compute Library's GEMM it was really slow compared to BLIS. CLBlast is nice, but only works with OpenCL.
https://arxiv.org/pdf/1904.06376.pdf ("Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations") is relevant reading for anyone following this thread.
@jacobgorm I have spoken to @dnparikh and @fgvanzee about this on a number of occasions and I am confident that this is a priority for them.
@fgvanzee I'd like to recant my prior comment in https://github.com/flame/blis/issues/234#issuecomment-405753540. For quantum chemistry, float16 might end up being more interesting. We are still studying this but it is ideal to have both for our experiments.
Intel published the BF16 ISA in the April 2019 update (319433-036) of the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference.
There is an unofficial synopsis for those who don't want to search the 149-page PDF on Anandtech.
@fgvanzee I'd like to recant my prior comment in #234 (comment). For quantum chemistry, float16 might end up being more interesting. We are still studying this but it is ideal to have both for our experiments.
I'm trying to imagine what could have changed (what observations you could have made) that would flip the polarity on this issue. (You need those extra three bits of mantissa after all?)
Jacob,
Investigating bfloat16 is on our priority list. We are waiting for word on funding from a sponsor, which may bump it higher on the priority list.
Robert
I'm trying to imagine what could have changed (what observations you could have made) that would flip the polarity on this issue. (You need those extra three bits of mantissa after all?)
We don’t need the exponent bits so why not use for mantissa?
We don’t need the exponent bits so why not use for mantissa?
Touche. Anyhow, I'm less concerned with what people want than I am with whether there is basic support for the datatype in either the compiler or the ISA (or both).
Clang now has experimental _Float16 support, but only on ARM : https://clang.llvm.org/docs/LanguageExtensions.html .
Sounds like ARM should sponsor this effort, so we can bump it up on our priority list! :-).
Thank you for sharing.
@jacobgorm https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point also says
__fp16 is supported on every target, as it is purely a storage format; see below.
and
__fp16 is a storage and interchange format only. This means that values of __fp16 are immediately promoted to (at least) float when used in arithmetic operations...
I would argue that BLIS should use typedef
to support either format as input data.
@jeffhammond the advantage as the library developer to having _Float16 in the compiler is that it does not promote to float, which should make initial development easier. I agree that the external interface could just as well be __fp16.
@jacobgorm Yes, of course, but since I work for Intel, I have an interest in implementing something that is not restricted to ARM architectures 😃 In any case, since BLIS is going to do all the important math explicitly in the microkernel, compiler promotion shouldn't be a major issue.
In any case, since BLIS is going to do all the important math explicitly in the microkernel, compiler promotion shouldn't be a major issue.
Let's all remember that BLIS allows the user to do more than level-3 operations! My goal is for full operation support for float16 (or bfloat16), even if the implementation is sub-optimal. So the issues around float16 and the compiler are very much important to me (even if efficiency is not).
So far as I'm aware, there isn't a standardized calling convention for _Float16 on intel, or at least if there is, my version of clang doesn't have it yet. As such we can't pass data by value, which makes things a little messy (and using __fp16 would imply we worked as __fp16 rather than as _Float16).
I also wanted to request support for reduced precision support. I think it would be valuable to add both IEEE 754's FP16 as well as Bfloat16 as the former has major issues for training ML.
P.S: There is also a new TF32 format from Nvidia: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
@amirgholami BLIS doesn't support GPUs but TF32 is just a form of 19-bit floating-point with 32b data. In the absence of hardware support, there is no upside versus SGEMM. In the presence of SGEMM, the implementation is going to be the same as SGEMM but with a different microkernel, except for the loss of accuracy in the results, of course.
Hey @jeffhammond
Yes I am aware of the fact that TF32 is supported on Ampere architecture. I mentioned it as evidence that there is still a lot of active research on low precision arithmetics. On that note I should also add MSFP8 and MSFP11 which are from Microsoft and being used in their brainwave fpga project.
Aside from the above formats, which are relatively newer formats, there are a lot of different LA algorithms that have already incorporated FP16 or BFloat16 (for example as preconditioners), and it would be great if bliss would support them.
P.S: Regarding hardware support, Intel CooperLake that was announced last month has support for bfloat16 arithmetics.
amd/blis fork adds aocl_gemm addon, that adds bf16 support to gemm for BF16-capable CPUs and sequence of functions for s8-u8 gemm for VNNI-capable CPUs. Additionally it adds support of ReLU/GeLU/Downscale/CLIP post-ops.
Merge of amd/blis changes is discussed in #770.