tblis icon indicating copy to clipboard operation
tblis copied to clipboard

Arm Kernels & Configurations

Open xrq-phys opened this issue 4 years ago • 5 comments

  • [x] Arm64 (Cortex-A53 or higher, ThunderX2, etc.)
  • [x] Arm32 (Cortex-A9, Cortex-A15, etc.)
  • [x] Arm64+SVE (A64fx, Neoverse, etc.)

Comments

  • flame/blis#344 way of determining Arm64 implementation is (partly) adopted here.
  • ~I'm creating configs armv7a and armv8a because config cortexa53 is roughly the same as cortexa57 in BLIS config (similarly: cortexa9 is close to cortexa15). So is it good to now remove cortex-a9 and cortex-a15 lines in configure.ac? Or should I rename current armv8a to cortexa53 to reproduce the current BLIS layout?~ See a2a42ce comments.

xrq-phys avatar Apr 15 '21 14:04 xrq-phys

@xrq-phys Thanks for this. armv8a is fine as long as it is generic enough across those uarchs (e.g. do Cortex-A53 and ThunderX2 share blocking paramters?). The only confusion is with A64fx et al. since those are also technically armv8a.

BTW the way kernels work in TBLIS is slightly different than in BLIS: a separate config is needed for each uarch that requires distinct blocking parameters or other settings, but multiple configs can share kernels by simply including the proper prototypes and adding a conditional block in the Makefile.am. See the AMD configs for examples.

devinamatthews avatar Apr 15 '21 15:04 devinamatthews

Thanks for the comments. In fact Cortex-A53 and ThunderX2 shared the same block size, but I want to further add TBLIS_CONFIG_?_THREAD_RATIO and TBLIS_CONFIG_?R_MAX_THREAD lines for ThunderX2 so I made it separate.

BTW the way kernels work in TBLIS is slightly different than in BLIS: ...

I see! In fact I've already made both armv8a and armv7a work. Just unsure about threading correctness & Autoconf coding style...

xrq-phys avatar Apr 15 '21 16:04 xrq-phys

Noticed that TBLIS requires block sizes to be compile-time constants (i.e. constexprs). Currently instantiating configs with 2 VLs (256 and 512bits) since VL>512 is not seen on any public roadmaps at the moment. And for 128bits, GEMM kernels have no difference for SVE than NEON.

xrq-phys avatar Apr 16 '21 18:04 xrq-phys

Noticed that TBLIS requires block sizes to be compile-time constants (i.e. constexprs).

Yes, although it would be possible to kludge runtime numbers in there. MR/NR do actually have to be compile-time constants as they are used as template parameters.

devinamatthews avatar Apr 16 '21 18:04 devinamatthews

After fixing *beta == 0 case, all test now pass for both Armv8a and ArmSVE :D

xrq-phys avatar Oct 02 '21 06:10 xrq-phys