DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Fix a bug in the implementation of dequantization for inference

Open sakogan opened this issue 2 years ago • 2 comments

The implementation of dequantization for inference appears to be incorrect.

It sets the number of blocks to ceil(output_size / hid_cnt*groups) and the size of each block (aka merge_hidden in dequantize_kernel) to hid_cnt*hidden_dim. However, for some combinations of arguments to launch_dequantize (e.g., with output_size=768, hidden_dim=768 and groups=144, which corresponds to a BERT-base model) this leads to the creation of overlapping blocks in dequantize_kernel and hence erroneous calculation of dequantized input.

The proposed solution is to get rid of hid_cnt and simply set the number of blocks to output_size / groups, while the size of each block becomes hidden_dim.

The PR includes a simple unittest that demonstrates the problem (in the baseline implementation) and validates the proposed solution on a number of use cases.

sakogan avatar May 03 '23 14:05 sakogan

@loadams any update/ETA on this PR?

FYI, I just rebased from master. Please, let me know if there is anything else I can do to facilitate reviewing/merging this PR.

sakogan avatar Jun 06 '23 15:06 sakogan

@loadams any update/ETA on this PR?

FYI, I just rebased from master. Please, let me know if there is anything else I can do to facilitate reviewing/merging this PR.

Getting the right folks to review, sorry for the delay on getting this reviewed/merged.

loadams avatar Jun 06 '23 17:06 loadams

Hi @sakogan ,

Thanks for the PR. Sorry for my delay to get back to this. The issue that you raised here is valid and definitely needs to be addressed. However, this kernel is aimed for the token-wise dequantization, in which case the output dimension needs to be divisible by the #groups. In the example that you demonstrated the problem, that is not the case as 768 is not divisible by 144 (also, I am not seeing this case in your unit test), and thus the kernel implementation per se will not give you the right result.

So, may I suggest that we revert these changes for this kernel and create another one that takes into account both hidden and output dimensions for getting the right group_index for reading the scale? For instance, if I understand correctly each 4K elements in the 768*768 matrix will have a different scale (considering 144 scales)

Best, Reza

RezaYazdaniAminabadi avatar Aug 16 '23 03:08 RezaYazdaniAminabadi

@RezaYazdaniAminabadi, thanks for the review!

My code indeed assumes that the output dimension is divisible by the #groups. It even includes this assert that would trigger if this is not the case.

I think I made a typo in my description -- I meant to say groups=48 instead of groups=144 (the issue also exists in the attention module where QKV weights are concatenated and use 3 times the number of groups, i.e., 48*3 = 144, but then output size should be 3 times larger as well). Anyway, sorry about the confusion! The case of groups=48 is present in my unit test.

sakogan avatar Aug 29 '23 21:08 sakogan

Thanks, this now makes sense 👍

RezaYazdaniAminabadi avatar Sep 14 '23 20:09 RezaYazdaniAminabadi