Casper

Results 293 comments of Casper

> https://github.com/qwopqwop200/llm-awq This is the AWQ code that has been changed to save in a format similar to GPTQ. The differences are: > > 1. Some variables have been renamed....

> FYI: A new quantization technique, SqueezeLLM which seems promising has been released 3 days ago, [github](https://github.com/SqueezeAILab/SqueezeLLM), [paper](https://arxiv.org/pdf/2306.07629.pdf) This looks good after reviewing. The main conclusion is that SqueezeLLM is...

Here is the specific line of code in this repository preventing packages like AutoGPTQ to quantize your MPT models: https://github.com/mosaicml/llm-foundry/blob/main/llmfoundry/models/mpt/modeling_mpt.py#L285

> output the attention matrices Yes, the algorithm will need attention matrices for quantization. > with flash attention I cannot give you a definitive answer here - the question is;...

> Working on the device_map issue as part of #225 Thank you, this should enable better compatibility with huggingface and enable more applications to be built on top of MPT!

Hi @abhi-mosaic - development is still ongoing. The recent improvements in foundry have made it much more manageable. A PR is open in [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ/pull/73#issuecomment-1578643013) where we are trying to quantize....

Closing this since it seems foundry added the required support for GPTQ to run, although GPTQ has not implemented quantization for MPT models yet.

> I'll just add my usual 2c on this subject: I would love if llama.cpp supported all major model types, bringing its hundreds of wonderful features to as many models...

> Too much work. Maybe once I get around to writing a binary that runs an exported ggml graph using CUDA (realistically in a few months at the earliest). A...

MosaicML (MPT creators) was just acquired by Databricks for $1.3B, so I expect more initiatives for LLMs. Even more of an argument to start supporting their Foundry models. @slaren since...