sglang icon indicating copy to clipboard operation
sglang copied to clipboard

How to launch server with quantized model?

Open Gintasz opened this issue 11 months ago • 8 comments

Sorry for a newb question, I don't find an answer. I succeeded in launching the server with unquantised Mistral7B:

python3 -m sglang.launch_server --model-path mistralai/Mistral-7B-Instruct-v0.2 --port 42069 --host 0.0.0.0

I'm trying to launch quantised model like this:

python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0

I get error:

[email protected]:~$ python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 164, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True'.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 11, in <module>
    launch_server(server_args, None)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/server.py", line 430, in launch_server
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 93, in __init__
    self.hf_config = get_config(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 33, in get_config
    config = AutoConfig.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 633, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 688, in _get_config_dict
    resolved_config_file = cached_file(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 462, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

I tried to download the repository locally and then specify directory as path:

git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ
python3 -m sglang.launch_server --model-path Mistral-7B-v0.1-GPTQ --port 42069 --host 0.0.0.0

But then I get

Rank 0: load weight begin.
quant_config: GPTQConfig(weight_bits=4, group_size=32, desc_act=True)
Process Process-1:
Traceback (most recent call last):
router init state: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
    model_client = ModelRpcClient(server_args, port_args)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 606, in __init__
    self.model_server.exposed_init_model(0, server_args, port_args)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 62, in exposed_init_model
    self.model_runner = ModelRunner(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 275, in __init__
    self.load_model()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 308, in load_model
    model.load_weights(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 290, in load_weights
    for name, loaded_weight in hf_model_weights_iterator(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 251, in hf_model_weights_iterator
    with safe_open(st_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

detoken init state: init ok

Gintasz avatar Mar 14 '24 15:03 Gintasz