oneDNN
oneDNN copied to clipboard
Understanding Injectors and evaluating their performance.
Hello, Can anyone explain why injectors are used in cpu operations? Specifically I find injectors for binary and eltwise operations, why not for other operations like batch normalisation? How to evaluate performance of injectors and how it is different from jit kernels like binary or eltwise? Also it will be helpful if Ianyone can share documents to understand injectors more.
Hi @vishwascm, thank you for the question.
Injector is a technology to move certain operations fusion from a memory level to a register level which speeds up things a lot. The technical difficulty here is to use proper "assembly objects" (such as GPRs and vector registers) without spoiling the whole operation since there are also kernel parts after the injector. Another challenge is to properly apply the fusion operation (esp. binary) given the space of broadcasts, data types and memory formats of the main and fusion tensors. This all requires special handling through a special classes which those injectors are.
Binary and eltwise operations are simple enough to be fused due to lack of dependency between each tensor value. Batch normalization has two global reductions in the middle, when compute statistics is required, which puts certain restrictions on simplicity.
To have a proper comparison what fusion provides, one must create a chain of main primitive + eltwise primitive and compare it against post-operation attribute with eltwise.
Injectors is an internal implementation detail, thus, it doesn't require any public documentation, and the only document available is comments through files.
Thanks @dzarukin for explaining injectors. Can you please elaborate on how to evaluate performance of binary injectors for different broadcast strategies using benchdnn with matmul as main primitive.
The approach would be same, compare two execution paths:
- Matmul primitive chained with a binary primitive, as reference, versus
- Matmul primitive + primitive attribute post-op with binary.
The way broadcast strategies are controlled is through dimensions. When it's a pure binary primitive, it's done through just dimensions, e.g., 2x3x4x5 : 1x1x1x1, when a single point of src1 would be applied for each point of src0. More details on broadcast can be found here.
For binary post-op it's controlled through matmul destination tensor dimensions (as src0) and a binary post-op memory descriptor dimensions (as src1), e.g. matmul inputs are 16x32 : 32x64, which makes destination 16x64, and binary post-op having 1x64 means a broadcast where a single N point from binary would be applied for each M point.
Thanks @dzarukin for sharing details.
Hi @dzarukin, I was trying to expand support for binary injectors on aarch64. To do this I wanted to know more details about how to test spatial, per_mb and batch broadcast as updated by @tczeszun on x64. I was not able to find the shapes using these broadcasting strategy.
Also why prelu is used in binary operator as in this commit https://github.com/oneapi-src/oneDNN/commit/d8f240f689b4568e4920624b7510b32152e8d41c ?
@vishwascm , here's a reproducer for those broadcast strategies (in order: per_mb, spatial, batch):
benchdnn --conv --attr-post-ops=add:f32:1,add:f32:3,add:f32:14 ic3oc64_ih224oh112kh7sh2dh0ph3_iw224ow112kw7sw2dw0pw3
They are expressed in terms of the attribute, not in terms of shape itself. It gets deduced based on destination memory descriptor. You may find more information on that here though the info might be slightly outdated for what values are supported. I used an integer mask as verbose output does.
For inference, applying prelu as post-op would be the same as applying binary post-op - find the proper offset for a second tensor and multiply a found value by a dst value (if the value was negative). Since the most challenging part is to find the proper element when there's a broadcast, it was expressed identically to binary post-op.