transformers-bloom-inference
transformers-bloom-inference copied to clipboard
Add configs to run int4 inference
Add some minor config changes to support int4 inference through DeepSpeed-Inference.
The Int4 support will be added to DeepSpeed through this PR.
cc: @stas00
Also should probably assert if int4 attempted to be used w/o deepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.
I can take care of that.
Also should probably assert if
int4attempted to be used w/odeepspeed>=xyzonce the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.I can take care of that.
Sounds good to me. Thanks @stas00
OK, I think I understand the limitations of pytorch and it'll get only worse when you try int3, etc. even if int4 is supported.
https://github.com/huggingface/transformers-bloom-inference/pull/37/files#r1026981222
I propose we break the currently proposed API and draw a better one.
I propose to have only 2 user-configurable args related to how deepspeed-inference operates
dtypeis the dtype of the original model - so only fp32, fp16 or bf16 - neverintX(i.e. we dropint8)quantization_bits:[None, 12, 8, 4, 3]
Now the API is simple, unambiguous and future proof (as in int12 or int3, Mixture of Precisions support)
For back-compat deepspeed.init_inference can simply set quantization_bits=8 if dtype==torch.int8 is passed. So the API will be unbroken.
What do you think, Reza?
Huh? Int4? I will test this branch surely and let you know. Thanks a lot for this :)
is simple, unambiguous and future pro
Hi @stas00, I agree with what you said, and we are going through the same route as you see from my last commit here. Thanks for the good suggestion :) Best, Reza
In that case, we
is simple, unambiguous and future pro
Hi @stas00, I agree with what you said, and we are going through the same route as you see from my last commit here. Thanks for the good suggestion :) Best, Reza
In this case, we can simply pass the bits to the DeepSpeed-inference config: kwargs['quant']['weight']['num_bits'] = quantization_bits
may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?
why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.
may I suggest that the just added
kwargs['quant']['weight']['num_bits']isn't the most user-friendly API as far askwargsgo?why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.
I agree, let me work on that and I fix it.
may I suggest that the just added
kwargs['quant']['weight']['num_bits']isn't the most user-friendly API as far askwargsgo? why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.I agree, let me work on that and I fix it.
@RezaYazdaniAminabadi -- please see my comment above. https://github.com/huggingface/transformers-bloom-inference/pull/37#discussion_r1027006363
may I suggest that the just added
kwargs['quant']['weight']['num_bits']isn't the most user-friendly API as far askwargsgo? why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.I agree, let me work on that and I fix it.
@RezaYazdaniAminabadi -- please see my comment above. #37 (comment)
thanks @awan-10. Please go ahead and push your changes.