lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Support `quantization_config` argument on HF backend

Open casper-hansen opened this issue 1 year ago • 7 comments

With AutoAWQ, we can fuse layers causing a 2-3x speedup directly by passing a quantization_config. If this argument can be supported, it will be possible to evaluate quantized models at a much faster pace.

An example config:

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,
    do_fuse=True,
)

https://huggingface.co/docs/transformers/v4.36.1/en/quantization#fusing-modules-for-supported-architectures

casper-hansen avatar Dec 30 '23 00:12 casper-hansen

We would be glad to support this feature!

It looks as though we should already support it out of the box, when quantization_config is in the model's HF config.json. (modulo potential issues arising due to us attempting to place the model onto a device manually?)

Regarding passing a quantization_config kwarg to from_pretrained(), we don't currently have a way to pass --model_args quantization_config=<nested dict of sub-values>, so some changes would need to be made to allow us to supply such a nested config via the CLI.

Another option would be to have a magic prefix s.t. any --model_args autogptq_* arg would be passed to init a GPTQConfig, or vice-versa for awq_* args going to AWQConfig. Given these configs can be doubly-nested though this seems annoying.

Would you be willing to test this functionality (as well as perhaps testing the full range of GPTQConfig values) and contribute a PR to the library? @casper-hansen

haileyschoelkopf avatar Dec 30 '23 14:12 haileyschoelkopf

It looks as though we should already support it out of the box, when quantization_config is in the model's HF config.json. (modulo potential issues arising due to us attempting to place the model onto a device manually?)

Although it is possible for the user to put this into the config, it requires extra steps. It would be much easier if it can be passed in programmatically because then I can add support for it in AutoAWQ.

Regarding passing a quantization_config kwarg to from_pretrained(), we don't currently have a way to pass --model_args quantization_config=<nested dict of sub-values>, so some changes would need to be made to allow us to supply such a nested config via the CLI.

Another option would be to have a magic prefix s.t. any --model_args autogptq_* arg would be passed to init a GPTQConfig, or vice-versa for awq_* args going to AWQConfig. Given these configs can be doubly-nested though this seems annoying.

Would you be willing to test this functionality (as well as perhaps testing the full range of GPTQConfig values) and contribute a PR to the library? @casper-hansen

I am at capacity in terms of work on open source, so unfortunately I do not have time to implement this functionality myself but would be happy to test it when support is provided.

casper-hansen avatar Dec 30 '23 15:12 casper-hansen

I am at capacity in terms of work on open source, so unfortunately I do not have time to implement this functionality myself but would be happy to test it when support is provided.

Thanks nevertheless for raising this issue, it's much appreciated!

I will see about adding support for this, via the route of allowing for --model_args arg1=<string dict of values we'll call json.loads on>,arg2=.... though I may not prioritize it.

If any other contributors would like to help out, please don't hesitate to comment or assign yourself!

haileyschoelkopf avatar Dec 30 '23 16:12 haileyschoelkopf

Hi @haileyschoelkopf

Are you actively working on this issue?

mahimairaja avatar Dec 30 '23 16:12 mahimairaja

Please let me know, If I could contribute my part to this issue

mahimairaja avatar Dec 30 '23 16:12 mahimairaja

I’m not, @mahimairaja go for it!

The places to adapt are in lm_eval.models.huggingface.HFLM and lm_eval.utils.simple_parse_args_string() (to add the json.loads triggering on {} characters in an arg’s value string) respectively.

haileyschoelkopf avatar Dec 30 '23 16:12 haileyschoelkopf

Thanks Hailey, Looking forward!

mahimairaja avatar Dec 30 '23 17:12 mahimairaja

Hi @mahimairaja , how is this going? do you need any help with it?

haileyschoelkopf avatar Jan 12 '24 16:01 haileyschoelkopf

HF has a new method for adding quantization methods, which any extended quantization integration should look at and support: https://huggingface.co/docs/transformers/main/en/hf_quantizer

haileyschoelkopf avatar Jan 31 '24 20:01 haileyschoelkopf

Actually, because we already support passing arbitrary keyword arguments to AutoModelForCausalLM.from_pretrained(), this is already supported.

You can therefore use the library programmatically with any quantization_config initialized or defined as a dict and then passed into HFLM's init, similar to the example in https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage !

Separate from this, will consider whether or how we want to allow users to pass nested configs through the CLI. Tracking this in #1366 .

At a certain level of complexity, simply using a Python script may make more sense.

haileyschoelkopf avatar Feb 01 '24 15:02 haileyschoelkopf