sglang
sglang copied to clipboard
Not able to run AWQ Mixtral on 4xA10
Hi,
Im trying to run the AWQ version of Mixtral on 4xA10s. However im getting this error. Ive also tried with --mem-frac 0.7
and still got the same error
Model I'm using : https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
Command : python -m sglang.launch_server --model-path /local_disk0/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ/ --port 30000 --tp 4
Code :
from sglang import function, system, user, assistant, gen
import sglang as sgl
@function
def multi_turn_question(s, question_1, question_2):
s += system("You are a helpful assistant.")
s += user(question_1)
s += assistant(gen("answer_1", max_tokens=256))
s += user(question_2)
s += assistant(gen("answer_2", max_tokens=256))
state = multi_turn_question.run(
question_1="What is the capital of the United Kingdom?",
question_2="List two local attractions.",
temperature=0.7,
stream=True,
)
for out in state.text_iter():
print(out, end="", flush=True)
print()
Error
new fill batch. #seq: 1. #cached_token: 0. #new_token: 34. #remaining_req: 0. #running_req: 0
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
self.forward_step()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
self.forward_fill_batch(new_batch)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 349, in forward_fill_batch
next_token_ids, next_token_probs = batch.sample(logits)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 375, in sample
sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
self.forward_step()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
self.forward_fill_batch(new_batch)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 349, in forward_fill_batch
next_token_ids, next_token_probs = batch.sample(logits)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 375, in sample
sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
self.forward_step()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
self.forward_fill_batch(new_batch)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 349, in forward_fill_batch
next_token_ids, next_token_probs = batch.sample(logits)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 375, in sample
sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
self.forward_step()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
self.forward_fill_batch(new_batch)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 349, in forward_fill_batch
next_token_ids, next_token_probs = batch.sample(logits)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 375, in sample
sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=391285, max_total_num_token=391319
KV cache pool leak detected!
warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=391285, max_total_num_token=391319
KV cache pool leak detected!
warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=391285, max_total_num_token=391319
KV cache pool leak detected!
warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-baadb11a-8dd2-4b96-a2e2-1e5e32b9d151/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=391285, max_total_num_token=391319
KV cache pool leak detected!
warnings.warn(
This issue is probably related to some bugs in vLLM. see also https://github.com/vllm-project/vllm/issues/2359
I have the same issue, additionally I also get a KV cache leak warning:
INFO: 127.0.0.1:56092 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 21. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 168, in exposed_step
self.forward_step()
File "/home/conic/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 183, in forward_step
self.forward_fill_batch(new_batch)
File "/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 399, in forward_fill_batch
next_token_ids, next_token_probs = batch.sample(logits)
File "/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/infer_batch.py", line 461, in sample
sampled_index = torch.multinomial(probs_sort, num_samples=1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
/home/conic/.local/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:210: UserWarning: Warning: available_size=98277, max_total_num_token=98319
KV cache pool leak detected!
warnings.warn(
This version of Mixtral worked for me: https://huggingface.co/casperhansen/mixtral-instruct-awq
Sorry for delay, can confirm @tom-doerr 's suggested model works!