blis icon indicating copy to clipboard operation
blis copied to clipboard

Armv6vfp config and and Arm32vfp sgemm kernel

Open kali opened this issue 6 years ago • 32 comments

Hey folks,

This is a minimum viable configuration for an optimized sgemm on armv6 with vfpv2, aka vfp. Devices like raspberry pi and pi0. On a very unscientific benchmark on a pi0, it's about one order of magnitude faster than the generic implementation, and about 15% faster than Openblas.

The kernel comes from the armv7 kernel set, it turns out it only needs vfp to work. I think the armv7 configurations wer not using it, actually, but the intrinsic neon one instead (neon is faster than vfp). I think the other .S kernels in armv7 would need at least vfpv3 so I have only hijacked the sgemm one. We may consider giving them a separate configuration...

By the way, I'm on a quest for a reasonably good sgemm on a wide variety of arm things, with cpuid-like detection, so there should be followups. Tell me what you think.

kali avatar Sep 13 '18 14:09 kali

I realize there are a few issues with the kernel actually. First of all it does not deal with the zero beta case. Also, I gathered from other conversations that the preferred approach for avoiding the ABI complexity is to use inline asm (or intrinsics), so I'll try and convert this kernel to an inline form...

So let's say this is a "do-not-merge-yet PR", but I'm still interested in knowing whet you think (or other pitfalls I have not seen).

kali avatar Sep 13 '18 16:09 kali

@kali Thanks for taking this initiative, Mathieu. However, for the uninitiated among us, including myself, can you explain what "vfp" is? I'm a little shaky on my ARM nomenclature.

fgvanzee avatar Sep 13 '18 17:09 fgvanzee

vfp == Vector Floating Point

dnparikh avatar Sep 13 '18 18:09 dnparikh

Once again, I'm no expert. Just willing to learn.

@dnparikh is right. But the emphasis must be put on the Floating Point part. Armv5, 6 and 7 spec do not necessarily have a FPU. So blis generic arm32 configuration uses software floating point stubs from the compiler/libc, and gets terrible performance. Just compiling with fpu=vfp would actually make a big difference.

vfp has actually several variants, which ultimately leads to "neon", the current simd implementation included in arm designs.

vfp2 (also called vfp as vfp1 was never implemented) has 32 f32 registers. vfp3 adds 32 more, but as far as I can tell, there is no significant population of processor that has vfp3 without being neon-compatible, so i'm not too sure it makes sense to have a separate vfp3 config. The cortexa9 and a15 processor both actually have neon. The armv7 kernel set should probably be better named armv7neon (because armv7 can come with a VFPv2 instead of a neon FPU...)

With armv8 things are a bit simpler as neon is part of the requirement. Except an armv8 can run in aarch32 mode (it accepts armv7 and neon stuff) or aarch64 (64bit, doubling the number of neon registers).

It's a bit complicated. This page gives some details with a kernel and operating system perspective which is not necessarily right for blis, but it has a remarkably high density of information. https://wiki.debian.org/ArmHardFloatPort

kali avatar Sep 13 '18 19:09 kali

@kali Thanks for all of that information, Mathieu. I like the idea of renaming the armv7a kernel set. I also like that you called it a "kernel set," as it makes me think that some people are actually reading my documentation! :)

fgvanzee avatar Sep 13 '18 19:09 fgvanzee

@fgvanzee Hey, I think this patch is actually in a good shape to be merged now. I have done the support for the ß=0 case...

Would you prefer me to tackle renaming the armv7a kernel set right here or do you prefer a separate PR ?

kali avatar Oct 01 '18 12:10 kali

@kali I'm fine with you combining the renaming into this PR (if you like). Before doing so, however, would you mind laying out your proposed renaming scheme here so I and others can comment?

fgvanzee avatar Oct 01 '18 18:10 fgvanzee

What this already PR does:

  • creates armv6vfp kernel set (sgemm only, adapted from an orphan semi working kernel in armv7a)
  • creates arm32vfp configuration using that kernel

What I think we should do on top of that, possibly in this PR

  • rename the armv7a kernel set to armv7neon
  • create an arm32neon configuration using it

More likely in one ore more separate PR:

  • add armv6vfp and armv7neon to the arm32 family
  • fix cpu identification for arm32 family to select between generic / armv6vfp / armv7neon (should we try and integrate cpuinfo as mentioned in #251 ? as a git submodule ?)

Later, optionally, if it makes sense for somebody:

  • creates armv6vfpv3 (or armv7vfp3, not too sure if armv6 vfp3 even make sense) kernel set from the .S kernels in what is still called the armv7a set
  • add a arm32vfp3 configuration, plug it in arm32 with cpuid

kali avatar Oct 02 '18 07:10 kali

The convention in BLIS has been to have configuration names and kernel names that reflect the architecture not the instruction set e.g. the Intel family we have configuration names of haswell, skx, and the corresponding kernel sets haswell, skx. And then machines like Skylake desktop, or Skylake servers that have only 1 VPU, the cpuinfo identifies that and registers the haswell kernels because performance wise it makes sense to do so.

Similarly for ARM, I think, we should still keep the kernel directories as armv6m, armv7a, armv8a, but have separate kernel files with the appropriate suffix (e.g. bli_dgemm_armv6_asm_vfp, or bli_dgemm_armv7_asm_neon). The configuration name should reflect the actual core/architecture name (e.g. cortexa57, cortexm0), and then there should be a family configuration arm32, arm64, or armv6m, armv7a (compare to x86_64). Moreover, the auto configuration would hopefully detect these nuances and register the right configuration.

@devinamatthews, and @figual are the experts for Intel and ARM architectures, so they might have an opinion as well.

dnparikh avatar Oct 02 '18 17:10 dnparikh

I think one of the issues here is that in the x86 world, the floating point unit (or simd extensions) are part of the microarchitecture definition. When you say "haswell" you know what to expect in terms of features, up to the SSE variant and presence of AVX.

Unfortunately armv7a do not specify such a thing. The FPU unit can be absent (for configurations which arguably have little interest to blis), VFP (which itself can be one of at least 2 significative variations), or neon. So I am afraid armv7a is much closer to a configuration family than an actual kernel set.

On top of that, if we want to stick to armv7a as a kernel set name, I think we will have to use different compiler options for different files in the kernel set. I'm not sure I have seen a provision for that...

Basically, my idea was that the most important thing to write microkernels are the FPU/SIMD capabilities. On Intel they are directly mapped from the microarchi name, but see https://en.wikipedia.org/wiki/Comparison_of_ARMv7-A_cores . The list is very non-exhaustive. Not mentioning the fact that FPU can be optional, there are two variants of VFP or two variants of neon to consider...

kali avatar Oct 04 '18 07:10 kali

@kali Thanks for your comments. I meant to chime in on this thread earlier. Sorry for the delay.

Unfortunately armv7a do not specify such a thing. The FPU unit can be absent (for configurations which arguably have little interest to blis), VFP (which itself can be one of at least 2 significative variations), or neon. So I am afraid armv7a is much closer to a configuration family than an actual kernel set.

On top of that, if we want to stick to armv7a as a kernel set name, I think we will have to use different compiler options for different files in the kernel set. I'm not sure I have seen a provision for that...

You are right that the entire kernel set is compiled with one set of compiler options, and those compiler options are determined by (the make_defs.mk in) the configuration.

Let's momentarily entertain a different solution that fits a little more naturally with the way BLIS configurations and kernels are currently managed: instead of multiple kernel sets, consider what would happen if we had a single armv7a kernel set with multiple configurations available. Suppose this armv7a kernel set contained both neon and [some variant of] VFP kernels, with each set of files named with a different infix. (Note/recall that when a kernel set is needed by a configuration foobar, all kernels inside will be compiled when targeting foobar, even if only a subset are actually registered for use by the bli_cntx_init_foobar() function.) What would happen if we ran configure targeting a configuration named cortex-ax that used neon-specific compiler flags? Presumably the neon kernels would be compiled as expected, but what would happen when the VFP kernels were compiled (with neon flags)? a.) Compiler error? b.) Runtime error? or c.) just suboptimally-built kernels? If the answer is (a), then this hypothetical solution won't work. But if the answer is (b) or (c), then it's fine because those VFP kernels will never be used by cortex-ax.

(I'm not ruling out your solution yet, but I want to consider others before settling on a final decision.)

fgvanzee avatar Oct 04 '18 21:10 fgvanzee

I don't mind considering alternatives. I have only been a contributor for a short period of time, and there is so much I don't know... My proposal was mostly meant to get the ball rolling.

Beyond the "do we care about unreachable code compiled with wrong options" question you raise, how would this work when targetting the whole arm32 family ? We would have several compilation-options variants of the same kernel functions in several .o files and would need to select the one which make sense at linking time, i think...

kali avatar Oct 05 '18 07:10 kali

Beyond the "do we care about unreachable code compiled with wrong options" question you raise, how would this work when targetting the whole arm32 family ? We would have several compilation-options variants of the same kernel functions in several .o files and would need to select the one which make sense at linking time, i think...

Thanks for raising this question, Matthieu.

First, some background. Suppose we are configuring to target a configuration family arm32, and arm32 consists of multiple sub-configurations x, y, and z. If x, y, and z all depend on the same kernel set k, we have to decide which set of compiler flags to use when compiling the kernels in k. In order to help us do this, we build what the in configure we call the "kernel-to-config" map (kconfig_map in configure). While the configuration list (config_list) tells us which sub-configurations belong within a configuration family, the kernel-to-config map tells us which sub-configurations depend on any given kernel set. Once the kernel-to-config map is built, we use the following rules to settle on a single sub-configuration (for the purposes of determining which compiler flags to use when compiling the kernels):

  1. If the list of configurations contains only one entry, use that name.
  2. Else, if the list contains a sub-configuration name that matches the kernel name, use that name.
  3. Otherwise, use the first name in the list.

In the above example, the config_list would be arm32: x y z and the kconfig_map would be k: x y z, with the kernels ultimately being compiled based on the flags specified by sub-configuration x.

It sounds like you envision building a configuration family where two or more included sub-configurations use the same kernel set, but the question of which compiler flags to use for either sub-config is a matter of need and not optimality. Is that correct?

fgvanzee avatar Oct 05 '18 19:10 fgvanzee

Hey @fgvanzee , sorry for the delay, I was trying to figure out what I really need. Ideally, what I would like to have in one single build, with cpuid-like runtime picking the most advanced of:

  1. the generic C kernels and framing, compiled without vfp
  2. the generic C kernels and framing, compiled with -mfpu=vfp
  3. the vfpv2-compatible kernels compiled with -mfpu=vfp (only the sgemm one I just adapted right now)
  4. the vfpv3 kernels compiled with -mfpu=vfpv3 (the .S from the current armv7a)
  5. the neon kernels compiled with -mfpu=neon (the intrinsics sgemm and dgemm from current armv7)

As of today, only sgemm and only at-least-vfp-devices are useful to me, so 1 could be dispensed with, and 4 could be aggregated with 5.

So think we at least need to compile and assemble 3 and 5 with different options. Even if 3 may work now with neon enabled (I have not tried it), as the assembly code should be unaffected, I would not bet on no compiler never doing anything breaking the function code.

So I think our options are

  • putting 3 and 5 in different kernel sets
  • making it possible to compile different file in a kernel set with different options
  • compile/assemble a kernel set several times, each time with a different set of options, and linking the right one
  • compile/assemble a kernel set several times, each time with a different set of options, linking them all and picking the right one at runtime

kali avatar Oct 09 '18 08:10 kali

@kali Sorry for the delay. We've been swamped.

Thanks for clearly laying out our choices. I think the added functionality described in 2nd, 3rd, and 4th bullets would go vastly under-utilized outside of ARM if I were to implement it. And I really don't want to implement it. So the 1st bullet is my preferred choice.

Also, I think (2) will be facilitated by enabling architecture-specific flags for use with the reference kernels in the generic configuration (discussed in issue #259).

fgvanzee avatar Oct 11 '18 18:10 fgvanzee

@kali BTW, I asked @figual to review this thread. He agrees that we could use separate kernel sets for vpu and neon. Let me know when you're ready to proceed. (There's no rush on our end.)

fgvanzee avatar Oct 17 '18 19:10 fgvanzee

Good to know that it makes sense for arm specialists. I'm pretty busy with other stuff right now and prefer not to loose too much focus, and as it looks like nobody is in a rush, I'll come back to this later in a few days. On Wed, Oct 17, 2018, at 21:14, Field G. Van Zee wrote:

@kali[1] BTW, I asked @figual[2] to review this thread. He agrees that we could use separate kernel sets for vpu and neon. Let me know when you're ready to proceed. (There's no rush on our end.)> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[3], or mute the thread[4].

Links:

  1. https://github.com/kali
  2. https://github.com/figual
  3. https://github.com/flame/blis/pull/250#issuecomment-430753798
  4. https://github.com/notifications/unsubscribe-auth/AADRmTc6DRmiR3GjKWNAYgLTyzgrT_bLks5ul4GogaJpZM4WnYpr

kali avatar Oct 17 '18 20:10 kali

I also like the idea of separating NEON and VFP kernels. Especially for Arm v7, if I'm not wrong, NEON only works for single-precision but not for double-precision. So ideally we should have a NEON kernel for SGEMM but a VFP kernel for DGEMM.

By the way, I noticed for the intrinsic version of Arm v7 kernel, DGEMM is actually plain C. Then in the compiler flag for cortexa9 and cortexa15, we have -mfpu=neon. Does this allow generation of VFP instructions from the plain C kernel? Has anyone tested the performance of DGEMM in these configurations?

koallen avatar Oct 19 '18 09:10 koallen

@koallen Mmm I'm not aware of a "single precision" limitation for armv7 neon. But it would not be vector anymore as the registers are 64bits... which may just explain why the DGEMM kernel is in plain C.

kali avatar Nov 05 '18 12:11 kali

@fgvanzee, as discussed, I added the armv7 neon kernel set renaming on top of the PR.

I'm no longer 100% sure what adding a arm32neon family would accomplish, so I think I'm quite happy with that for now.

kali avatar Nov 05 '18 14:11 kali

@kali Are you ready for me and others to review what you've done? It's fine if you'd like more time.

fgvanzee avatar Nov 05 '18 18:11 fgvanzee

Well, I just fixed a remaining oopsie that should not have been committed, but I think it's now in good shape, and offers a net gain on armv6 platforms with vfp. So yeah, please proceed.

kali avatar Nov 06 '18 09:11 kali

Thanks @kali. I'm taking some time off, and we're also preparing a new journal article submission. It may be later this week or early next when I take a look at it.

fgvanzee avatar Nov 06 '18 19:11 fgvanzee

@kali Another update: I'm getting slammed with high-priority items, so it'll be another bit before I can review this PR. Thanks for your patience.

fgvanzee avatar Nov 13 '18 19:11 fgvanzee

@fgvanzee no problem.

kali avatar Nov 13 '18 19:11 kali

Any update on this? I think Debian's armhf needs this patchset since NEON violates the ISA baseline... https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919710 I'll have to temporarily switch to generic kernels for the armhf port.

cdluminate avatar Jan 19 '19 01:01 cdluminate

Hey! Well, I've done as much as I can on this issue at this point. I am no arm expert, so having somebody review this would be great.

The late october changes in tests made some bug appear, but I fixed them during the holidays (@fgvanzee actually, this showed that only the trsm test truly exercise the gemm kernels with small dimensions with default configuration...). I feel like I'm getting better at this, but once again, an expert review would be nice.

kali avatar Jan 19 '19 08:01 kali

@cdluminate And yes this configuration and kernel should work for debian/armhf and raspbian. The cache sizes will not be optimum though, as they can vary greatly from one processor to the other... This is one more thing that we probably have to detect at runtime if we want a "build once run everywhere" blis on arm...

kali avatar Jan 19 '19 08:01 kali

(@fgvanzee actually, this showed that only the trsm test truly exercise the gemm kernels with small dimensions with default configuration...).

Are you referring to the fact that, because of the high-level transposition of the operation in bli_gemm_front.c to bring the storage format of C into agreement with the storage preference of the microkernel, the general-stride IO case of microkernels is not exercised (except in trsm)?

fgvanzee avatar Jan 19 '19 21:01 fgvanzee

@fgvanzee mmm... I realise now, looking back at the fixes I did three weeks ago, that I got confused on the way somewhere. The general idea being, it would be nice to have a part of the test suite dedicated to exercising micro kernels in as many way as practical.

Looking back at the latest fix i did, it is not completely clear how making the trsm tests stronger exhibhited bugs that where not visible running gemm tests. Part of it was the pre-existing of NaN on the fp registers, and the initial code was certainly not dealing with that correctly.

As for the missing "memory" clobber flag, I think it's more random luck that one test finally triggered it (once in a while).

So... yeah. It would be nice to have general strides and as many corner cases covered by specific unit tests right on top of the ukernel interface. But in all fairness, I'm not sure they would have helped me in finding these two bugs.

kali avatar Jan 20 '19 12:01 kali