nunchaku
nunchaku copied to clipboard
locating code for activation quantization with group size 64?
Hey there! I've got a super simple question after doing a ton of code searching. Which code can show that the activation's quantization is based on a per group size of 64?
Since I learned from quantize_w4a4_from_fpsum_warp(https://github.com/mit-han-lab/nunchaku/blob/main/src/kernels/zgemm/gemm_w4a4.cuh#L460) that input[2][8] (28half2_t) are statistics for 2 scales. That means every input[x] (16 half elements) per line get one single scale. Then it's saved to output_scale, and loaded by q_act.scales (I already know that IN_FEATURE_PAD/64 is allocated for it). 😊
express my profound gratitude.