mlc-llm
mlc-llm copied to clipboard
[Feature Request] quantization mode "q8f16_1 "
🚀 Feature
Could you please add the quantization mode "q8f16_1" ?
/mlc-llm$ mlc_chat convert_weight ./dist/models/phi-2/ --quantization q8f16_1 -o dist/phi-2-q8f16_1-MLC [2024-03-10 23:12:46] INFO auto_config.py:115: Found model configuration: dist/models/phi-2/config.json ------------------------- Usage ------------------------- usage: MLC AutoLLM Quantization Framework [-h] --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft} [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion}] [--device DEVICE] [--source SOURCE] [--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output OUTPUT config
positional arguments:
config 1) Path to a HuggingFace model directory that contains a config.json
or 2) Path to config.json
in HuggingFace format, or 3) The name of a pre-defined model architecture. A config.json
file in
HuggingFace format defines the model architecture, including the vocabulary size, the number of layers, the hidden size, number of attention heads, etc. Example:
https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory often contains a config.json
which defines the model architecture, the non-quantized model weights in
PyTorch or SafeTensor format, tokenizer configurations, as well as an optional generation_config.json
provides additional default configuration for text generation. Example:
https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main. (required)
options:
-h, --help show this help message and exit
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft}
The quantization mode we use to compile. If unprovided, will infer from model
. (required, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft)
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion}
Model architecture such as "llama". If not set, it is inferred from mlc-chat-config.json
. (default: "auto")
--device DEVICE The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified. (default: "auto")
--source SOURCE The path to original model weight, infer from config
if missing. (default: "auto")
--source-format {auto,huggingface-torch,huggingface-safetensor,awq}
The format of source model weight, infer from config
if missing. (default: "auto", choices: auto, huggingface-torch, huggingface-safetensor, awq")
--output OUTPUT, -o OUTPUT
The output directory to save the quantized model weight. Will create params_shard_*.bin
and ndarray-cache.json
in this directory. (required)
------------------------- Error -------------------------
argument --quantization: invalid choice: 'q8f16_1' (choose from 'q0f16', 'q0f32', 'q3f16_0', 'q3f16_1', 'q4f16_0', 'q4f16_1', 'q4f32_1', 'q4f16_2', 'q4f16_autoawq', 'q4f16_ft')
Motivation
The model (phi-2-q4f16_1) is working on Android Device, and the model (phi-2-q0f16) needs more than 5GB RAM. I expect the The model (phi-2-q8f16_1) should work on Android Device, and it's accuracy will be better than The model (phi-2-q4f16_1).