sglang Not able to run AWQ Mixtral on 4xA10

Hi,

Im trying to run the AWQ version of Mixtral on 4xA10s. However im getting this error. Ive also tried with --mem-frac 0.7 and still got the same error

Model I'm using : https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ

Command : python -m sglang.launch_server --model-path /local_disk0/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ/ --port 30000 --tp 4

Code :

from sglang import function, system, user, assistant, gen
import sglang as sgl

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

state = multi_turn_question.run(
    question_1="What is the capital of the United Kingdom?",
    question_2="List two local attractions.",
    temperature=0.7,
    stream=True,
)

for out in state.text_iter():
    print(out, end="", flush=True)
print()

Error

new fill batch. #seq: 1. #cached_token: 0. #new_token: 34. #remaining_req: 0. #running_req: 0
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 349, in forward_fill_batch
    next_token_ids, next_token_probs = batch.sample(logits)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 375, in sample
    sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 349, in forward_fill_batch
    next_token_ids, next_token_probs = batch.sample(logits)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 375, in sample
    sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 349, in forward_fill_batch
    next_token_ids, next_token_probs = batch.sample(logits)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 375, in sample
    sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 349, in forward_fill_batch
    next_token_ids, next_token_probs = batch.sample(logits)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 375, in sample
    sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=391285, max_total_num_token=391319
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=391285, max_total_num_token=391319
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=391285, max_total_num_token=391319
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=391285, max_total_num_token=391319
KV cache pool leak detected!
  warnings.warn(

Jan 22 '24 18:01 nivibilla

This issue is probably related to some bugs in vLLM. see also https://github.com/vllm-project/vllm/issues/2359

Jan 23 '24 10:01 merrymercy

I have the same issue, additionally I also get a KV cache leak warning:

INFO:     127.0.0.1:56092 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 21. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 168, in exposed_step
    self.forward_step()
  File "/home/conic/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 183, in forward_step
    self.forward_fill_batch(new_batch)
  File "/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 399, in forward_fill_batch
    next_token_ids, next_token_probs = batch.sample(logits)
  File "/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 461, in sample
    sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:210: UserWarning: Warning: available_size=98277, max_total_num_token=98319
KV cache pool leak detected!
  warnings.warn(

Feb 05 '24 17:02 tom-doerr

This version of Mixtral worked for me: https://huggingface.co/casperhansen/mixtral-instruct-awq

Feb 05 '24 18:02 tom-doerr

Sorry for delay, can confirm @tom-doerr 's suggested model works!

Feb 22 '24 14:02 nivibilla

sglang sglang copied to clipboard

Not able to run AWQ Mixtral on 4xA10

sglang
sglang copied to clipboard