oneDNN Understanding Injectors and evaluating their performance.

trafficstars

Hello, Can anyone explain why injectors are used in cpu operations? Specifically I find injectors for binary and eltwise operations, why not for other operations like batch normalisation? How to evaluate performance of injectors and how it is different from jit kernels like binary or eltwise? Also it will be helpful if Ianyone can share documents to understand injectors more.

Apr 24 '24 09:04 vishwascm

Hi @vishwascm, thank you for the question.

Injector is a technology to move certain operations fusion from a memory level to a register level which speeds up things a lot. The technical difficulty here is to use proper "assembly objects" (such as GPRs and vector registers) without spoiling the whole operation since there are also kernel parts after the injector. Another challenge is to properly apply the fusion operation (esp. binary) given the space of broadcasts, data types and memory formats of the main and fusion tensors. This all requires special handling through a special classes which those injectors are.

Binary and eltwise operations are simple enough to be fused due to lack of dependency between each tensor value. Batch normalization has two global reductions in the middle, when compute statistics is required, which puts certain restrictions on simplicity.

To have a proper comparison what fusion provides, one must create a chain of main primitive + eltwise primitive and compare it against post-operation attribute with eltwise.

Injectors is an internal implementation detail, thus, it doesn't require any public documentation, and the only document available is comments through files.

Apr 24 '24 17:04 dzarukin

Thanks @dzarukin for explaining injectors. Can you please elaborate on how to evaluate performance of binary injectors for different broadcast strategies using benchdnn with matmul as main primitive.

Apr 25 '24 13:04 vishwascm

The approach would be same, compare two execution paths:

Matmul primitive chained with a binary primitive, as reference, versus
Matmul primitive + primitive attribute post-op with binary.

The way broadcast strategies are controlled is through dimensions. When it's a pure binary primitive, it's done through just dimensions, e.g., 2x3x4x5 : 1x1x1x1, when a single point of src1 would be applied for each point of src0. More details on broadcast can be found here.

For binary post-op it's controlled through matmul destination tensor dimensions (as src0) and a binary post-op memory descriptor dimensions (as src1), e.g. matmul inputs are 16x32 : 32x64, which makes destination 16x64, and binary post-op having 1x64 means a broadcast where a single N point from binary would be applied for each M point.

Apr 25 '24 15:04 dzarukin

Thanks @dzarukin for sharing details.

Apr 26 '24 06:04 vishwascm

Hi @dzarukin, I was trying to expand support for binary injectors on aarch64. To do this I wanted to know more details about how to test spatial, per_mb and batch broadcast as updated by @tczeszun on x64. I was not able to find the shapes using these broadcasting strategy.
Also why prelu is used in binary operator as in this commit https://github.com/oneapi-src/oneDNN/commit/d8f240f689b4568e4920624b7510b32152e8d41c ?

May 08 '24 10:05 vishwascm

@vishwascm , here's a reproducer for those broadcast strategies (in order: per_mb, spatial, batch): benchdnn --conv --attr-post-ops=add:f32:1,add:f32:3,add:f32:14 ic3oc64_ih224oh112kh7sh2dh0ph3_iw224ow112kw7sw2dw0pw3

They are expressed in terms of the attribute, not in terms of shape itself. It gets deduced based on destination memory descriptor. You may find more information on that here though the info might be slightly outdated for what values are supported. I used an integer mask as verbose output does.

For inference, applying prelu as post-op would be the same as applying binary post-op - find the proper offset for a second tensor and multiply a found value by a dst value (if the value was negative). Since the most challenging part is to find the proper element when there's a broadcast, it was expressed identically to binary post-op.

May 08 '24 17:05 dzarukin

oneDNN oneDNN copied to clipboard

Understanding Injectors and evaluating their performance.

oneDNN
oneDNN copied to clipboard