transformers-bloom-inference Add configs to run int4 inference

Add some minor config changes to support int4 inference through DeepSpeed-Inference.

The Int4 support will be added to DeepSpeed through this PR.

cc: @stas00

Nov 18 '22 19:11 RezaYazdaniAminabadi

Also should probably assert if int4 attempted to be used w/o deepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.

I can take care of that.

Nov 18 '22 19:11 stas00

Also should probably assert if int4 attempted to be used w/o deepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.

I can take care of that.

Sounds good to me. Thanks @stas00

Nov 18 '22 21:11 RezaYazdaniAminabadi

OK, I think I understand the limitations of pytorch and it'll get only worse when you try int3, etc. even if int4 is supported. https://github.com/huggingface/transformers-bloom-inference/pull/37/files#r1026981222

I propose we break the currently proposed API and draw a better one.

I propose to have only 2 user-configurable args related to how deepspeed-inference operates

dtype is the dtype of the original model - so only fp32, fp16 or bf16 - never intX (i.e. we drop int8)
quantization_bits: [None, 12, 8, 4, 3]

Now the API is simple, unambiguous and future proof (as in int12 or int3, Mixture of Precisions support)

For back-compat deepspeed.init_inference can simply set quantization_bits=8 if dtype==torch.int8 is passed. So the API will be unbroken.

What do you think, Reza?

Nov 19 '22 00:11 stas00

Huh? Int4? I will test this branch surely and let you know. Thanks a lot for this :)

Nov 19 '22 00:11 mayank31398

is simple, unambiguous and future pro

Hi @stas00, I agree with what you said, and we are going through the same route as you see from my last commit here. Thanks for the good suggestion :) Best, Reza

Nov 19 '22 01:11 RezaYazdaniAminabadi

In that case, we

is simple, unambiguous and future pro

Hi @stas00, I agree with what you said, and we are going through the same route as you see from my last commit here. Thanks for the good suggestion :) Best, Reza

In this case, we can simply pass the bits to the DeepSpeed-inference config: kwargs['quant']['weight']['num_bits'] = quantization_bits

Nov 19 '22 01:11 RezaYazdaniAminabadi

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?

why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

Nov 19 '22 01:11 stas00

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?

why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

Nov 19 '22 01:11 RezaYazdaniAminabadi

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go? why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

@RezaYazdaniAminabadi -- please see my comment above. https://github.com/huggingface/transformers-bloom-inference/pull/37#discussion_r1027006363

Nov 19 '22 01:11 awan-10

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go? why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

@RezaYazdaniAminabadi -- please see my comment above. #37 (comment)

thanks @awan-10. Please go ahead and push your changes.

Nov 19 '22 04:11 RezaYazdaniAminabadi

transformers-bloom-inference transformers-bloom-inference copied to clipboard

Add configs to run int4 inference

transformers-bloom-inference
transformers-bloom-inference copied to clipboard