sglang
sglang copied to clipboard
How to launch server with quantized model?
Sorry for a newb question, I don't find an answer. I succeeded in launching the server with unquantised Mistral7B:
python3 -m sglang.launch_server --model-path mistralai/Mistral-7B-Instruct-v0.2 --port 42069 --host 0.0.0.0
I'm trying to launch quantised model like this:
python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0
I get error:
[email protected]:~$ python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
resolved_file = hf_hub_download(
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
validate_repo_id(arg_value)
File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 164, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True'.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 11, in <module>
launch_server(server_args, None)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/server.py", line 430, in launch_server
tokenizer_manager = TokenizerManager(server_args, port_args)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 93, in __init__
self.hf_config = get_config(
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 33, in get_config
config = AutoConfig.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 633, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 688, in _get_config_dict
resolved_config_file = cached_file(
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 462, in cached_file
raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
I tried to download the repository locally and then specify directory as path:
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ
python3 -m sglang.launch_server --model-path Mistral-7B-v0.1-GPTQ --port 42069 --host 0.0.0.0
But then I get
Rank 0: load weight begin.
quant_config: GPTQConfig(weight_bits=4, group_size=32, desc_act=True)
Process Process-1:
Traceback (most recent call last):
router init state: Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
model_client = ModelRpcClient(server_args, port_args)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 606, in __init__
self.model_server.exposed_init_model(0, server_args, port_args)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 62, in exposed_init_model
self.model_runner = ModelRunner(
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 275, in __init__
self.load_model()
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 308, in load_model
model.load_weights(
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 290, in load_weights
for name, loaded_weight in hf_model_weights_iterator(
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 251, in hf_model_weights_iterator
with safe_open(st_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
detoken init state: init ok