XNNPACK
XNNPACK copied to clipboard
Kleidi introduce new 16x4 kernels
Kleidi is adding new 16x4 optimized kernels here with specialized groupsize:
https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/117/diffs#ca37d1189498b4e5ccb6f6dcb9e43501a28cb5c6
We see some good performance, so bumping the kleidi pin and adding them here
We see some rather significant speed up on prefill performance for Llama Models:
Before:
I 00:00:05.587790 executorch:stats.h:84] Prompt Tokens: 64 Generated Tokens: 63
I 00:00:05.587793 executorch:stats.h:90] Model Load Time: 3.999000 (seconds)
I 00:00:05.587796 executorch:stats.h:100] Total inference time: 1.579000 (seconds) Rate: 39.898670 (tokens/second)
I 00:00:05.587806 executorch:stats.h:108] Prompt evaluation: 0.219000 (seconds) Rate: 292.237443 (tokens/second)
I 00:00:05.587809 executorch:stats.h:119] Generated 63 tokens: 1.360000 (seconds) Rate: 46.323529 (tokens/second)
I 00:00:05.587812 executorch:stats.h:127] Time to first generated token: 0.219000 (seconds)
I 00:00:05.587816 executorch:stats.h:134] Sampling time over 127 tokens: 0.014000 (seconds)
After
I 00:00:05.917623 executorch:stats.h:97] Prompt Tokens: 64 Generated Tokens: 63
I 00:00:05.917626 executorch:stats.h:103] Model Load Time: 0.000000 (seconds)
I 00:00:05.917628 executorch:stats.h:113] Total inference time: 1.326000 (seconds) Rate: 47.511312 (tokens/second)
I 00:00:05.917632 executorch:stats.h:121] Prompt evaluation: 0.179000 (seconds) Rate: 357.541899 (tokens/second)
I 00:00:05.917635 executorch:stats.h:132] Generated 63 tokens: 1.147000 (seconds) Rate: 54.925894 (tokens/second)
I 00:00:05.917639 executorch:stats.h:140] Time to first generated token: 0.179000 (seconds)
I 00:00:05.917641 executorch:stats.h:147] Sampling time over 127 tokens: 0.009000 (seconds)
@alankelly @gonnet