Edward Shogulin
Edward Shogulin
### Details: - *[ARM] [INT8] FullyConnected* ### Tickets: - *ticket-id*
### Context [JIT Emitters](https://github.com/openvinotoolkit/openvino/blob/42f1cb095143f19c0b9ee25836c29748bc8d9bf2/src/plugins/intel_cpu/src/emitters/README.md) are part of code generation feature (a.k.a. tensor compiler) that automatically produces highly-efficient optimized fused subgraph binary code. Each emitter implements specific operation from low level...
### Context [JIT Emitters](https://github.com/openvinotoolkit/openvino/blob/42f1cb095143f19c0b9ee25836c29748bc8d9bf2/src/plugins/intel_cpu/src/emitters/README.md) are part of code generation feature (a.k.a. tensor compiler) that automatically produces highly-efficient optimized fused subgraph binary code. Each emitter implements specific operation from low level...
### Context [JIT Emitters](https://github.com/openvinotoolkit/openvino/blob/42f1cb095143f19c0b9ee25836c29748bc8d9bf2/src/plugins/intel_cpu/src/emitters/README.md) are part of code generation feature (a.k.a. tensor compiler) that automatically produces highly-efficient optimized fused subgraph binary code. Each emitter implements specific operation from low level...
Issues: 1. Default selected low precision kernel is not optimal for described below platform. 2. We have only 30% performance gain for low precision kernel VS fp16 in multithreaded mode....
In accordance with documentation [NEGEMMLowpMatrixMultiplyCore](https://arm-software.github.io/ComputeLibrary/v24.06/classarm__compute_1_1_n_e_g_e_m_m_lowp_matrix_multiply_core.xhtml) suports only limited combinations of `QSYMM8` and `QASYMM8_SIGNED` precisions on inputs: src0 | src1 | src2 | dst -- | -- | -- | --...
Model: ```mermaid graph TD; Input1["Input src1: fp32"] Quantise1["NEQuantizationLayer q_src1: QASYMM8_SIGNED"] Input2["Input src2: fp32"] Quantise2["NEQuantizationLayer q_src2: QASYMM8_SIGNED"] MatMul["NEGEMMLowpMatrixMultiplyCore q_res: S8"] Input1-->Quantise1; Input2-->Quantise2; Quantise1-->MatMul; Quantise2-->MatMul; MatMul-->Result; ``` Can you confirm that `NEGEMMLowpMatrixMultiplyCore`...
Hi guys, I'm extremelly interested to speed up int8 `MatMul` inference with ARM Compute Library kernel. My model is: ```mermaid graph TD; Input1["Input out: fp32"] Quantise1["NEQuantizationLayer out: signed int8"] Input2["Input...
### Details: - *item1* - *...* ### Tickets: - *ticket-id*