XNNPACK
XNNPACK copied to clipboard
High-efficiency floating-point neural network inference operators for mobile, server, and Web
Replaced gavgpool and gsumpool with static_reduce.
Integration of new kleidi
Set requantization scale upper bound to 1.0
Simplify the `xnn_pack_f{16,32}_gemm_g{io,oi}_w` functions to use `memcpy` and `memset` where appropriate (`kr=1` and `sr=1`). This significantly speeds up the packing of non-static right-hand operands to the `f32` and `f16` `FullyConnected`...
Add 5x8c8 wasmdot kernels, as it performs better than the default configuration. Especially it will get better speedup after AVX-256 revectorization (see [chromium bug](https://issues.chromium.org/issues/42202660)) at runtime. --------------------------------------------------------------------------------------------------- | | --...
Add qs8 c4 wasmsdot templates which can perform better than the default configuration. The new template helps generate more effecient AVX-256 revectorized code with very few inserts at run time....
Add initial bits for RNDNU16 requantization.
Put back missing packing optimization.