mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

[Feature Request] quantization mode "q8f16_1 "

Open taeyeonlee opened this issue 11 months ago • 0 comments

🚀 Feature

Could you please add the quantization mode "q8f16_1" ?

/mlc-llm$ mlc_chat convert_weight ./dist/models/phi-2/ --quantization q8f16_1 -o dist/phi-2-q8f16_1-MLC [2024-03-10 23:12:46] INFO auto_config.py:115: Found model configuration: dist/models/phi-2/config.json ------------------------- Usage ------------------------- usage: MLC AutoLLM Quantization Framework [-h] --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft} [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion}] [--device DEVICE] [--source SOURCE] [--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output OUTPUT config

positional arguments: config 1) Path to a HuggingFace model directory that contains a config.json or 2) Path to config.json in HuggingFace format, or 3) The name of a pre-defined model architecture. A config.json file in HuggingFace format defines the model architecture, including the vocabulary size, the number of layers, the hidden size, number of attention heads, etc. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory often contains a config.json which defines the model architecture, the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional generation_config.json provides additional default configuration for text generation. Example: https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main. (required)

options: -h, --help show this help message and exit --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft} The quantization mode we use to compile. If unprovided, will infer from model. (required, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft) --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion} Model architecture such as "llama". If not set, it is inferred from mlc-chat-config.json. (default: "auto") --device DEVICE The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified. (default: "auto") --source SOURCE The path to original model weight, infer from config if missing. (default: "auto") --source-format {auto,huggingface-torch,huggingface-safetensor,awq} The format of source model weight, infer from config if missing. (default: "auto", choices: auto, huggingface-torch, huggingface-safetensor, awq") --output OUTPUT, -o OUTPUT The output directory to save the quantized model weight. Will create params_shard_*.bin and ndarray-cache.json in this directory. (required) ------------------------- Error ------------------------- argument --quantization: invalid choice: 'q8f16_1' (choose from 'q0f16', 'q0f32', 'q3f16_0', 'q3f16_1', 'q4f16_0', 'q4f16_1', 'q4f32_1', 'q4f16_2', 'q4f16_autoawq', 'q4f16_ft')

Motivation

The model (phi-2-q4f16_1) is working on Android Device, and the model (phi-2-q0f16) needs more than 5GB RAM. I expect the The model (phi-2-q8f16_1) should work on Android Device, and it's accuracy will be better than The model (phi-2-q4f16_1).

taeyeonlee avatar Mar 10 '24 14:03 taeyeonlee