neural-speed
neural-speed copied to clipboard
Int4 dequantize kernel
Type of Change
feature or bug fix or documentation or others: feature API changed or not: add a new kernel
Description
int4 dequantize kernel with very high bandwidth utilization.
MTL: kernel bandwidth: ~85GB/s reported by VTune, hardware maximum bandwidth: ~85GB/s reported by clpeak, nearly 100% utilization;
ARC 770M: at least 90%+ vram bandwidth utilization.
Expected Behavior & Potential Risk
the expected behavior that triggered by this PR
How has this PR been tested?
how to reproduce the test (including hardware information)
Dependency Change?
any library dependency introduced or removed