Casper
Casper
Looks like the authors do not have a plan to support MPT models
> @TheBloke or anyone else > > Do you know what `with act-order` means? From what I get, it means to sort the "activations" before quantizing them. But the trouble...
> I am able to get reasonable result from FasterTransformer + MPT-7B-Storywriter with 2 changes to FasterTransformer Thanks for documenting this. Not sure if FasterTransformer would accept a PR for...
> It looks as though we should already support it out of the box, when `quantization_config` is in the model's HF config.json. (modulo potential issues arising due to us attempting...
I have an example script that works with Mixtral: https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py
I just used the following Docker image and ran `pip install vllm` `runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04`
Could you try the Docker image I referenced to see if it's an environment issue?
Not sure if this relates to #2203. Does it work in FP16 with TP > 1?
Tagging @WoosukKwon @zhuohan123 for visibility. Seems Mixtral has issues with TP > 1 when using AWQ.
Woosuk said it should be fixed in the new 0.2.7 by PR #2208. Could someone verify with AWQ version? Reference: https://github.com/vllm-project/vllm/issues/2332#issuecomment-18761736055