sglang
sglang copied to clipboard
[Bug] Server crashes after loading (Mixtral 8x7b) on L4
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.
Describe the bug
Model fully loads, server runs and then instantly crashes
server_args=ServerArgs(model_path='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_path='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', trust_remote_code=False, context_length=8192, quantization=None, served_model_name='mixtral-8x7b-v0.1', chat_template=None, host='0.0.0.0', port=1234, additional_ports=[1235, 1236, 1237, 1238], mem_fraction_static=0.83, max_running_requests=32, max_num_reqs=32, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=8, stream_interval=1, random_seed=759329088, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=True, enable_mla=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu=0] Init nccl begin.
[gpu=5] Init nccl begin.
[gpu=7] Init nccl begin.
[gpu=1] Init nccl begin.
[gpu=3] Init nccl begin.
[gpu=6] Init nccl begin.
[gpu=2] Init nccl begin.
[gpu=4] Init nccl begin.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[gpu=6] Load weight begin. avail mem=21.65 GB
[gpu=5] Load weight begin. avail mem=21.65 GB
[gpu=7] Load weight begin. avail mem=21.65 GB
[gpu=4] Load weight begin. avail mem=21.65 GB
[gpu=3] Load weight begin. avail mem=21.65 GB
[gpu=1] Load weight begin. avail mem=21.65 GB
[gpu=0] Load weight begin. avail mem=21.65 GB
[gpu=2] Load weight begin. avail mem=21.65 GB
Loading safetensors checkpoint shards: 0% Completed | 0/19 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 5% Completed | 1/19 [00:00<00:13, 1.37it/s]
Loading safetensors checkpoint shards: 11% Completed | 2/19 [00:01<00:14, 1.19it/s]
Loading safetensors checkpoint shards: 16% Completed | 3/19 [00:02<00:14, 1.14it/s]
Loading safetensors checkpoint shards: 21% Completed | 4/19 [00:03<00:13, 1.07it/s]
Loading safetensors checkpoint shards: 26% Completed | 5/19 [00:04<00:13, 1.02it/s]
Loading safetensors checkpoint shards: 32% Completed | 6/19 [00:05<00:13, 1.01s/it]
Loading safetensors checkpoint shards: 37% Completed | 7/19 [00:06<00:12, 1.01s/it]
[gpu=7] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
Loading safetensors checkpoint shards: 42% Completed | 8/19 [00:07<00:11, 1.02s/it]
Loading safetensors checkpoint shards: 47% Completed | 9/19 [00:08<00:09, 1.01it/s]
Loading safetensors checkpoint shards: 53% Completed | 10/19 [00:09<00:08, 1.08it/s]
Loading safetensors checkpoint shards: 58% Completed | 11/19 [00:10<00:07, 1.08it/s]
Loading safetensors checkpoint shards: 63% Completed | 12/19 [00:11<00:06, 1.07it/s]
Loading safetensors checkpoint shards: 68% Completed | 13/19 [00:12<00:05, 1.07it/s]
Loading safetensors checkpoint shards: 74% Completed | 14/19 [00:13<00:04, 1.04it/s]
Loading safetensors checkpoint shards: 79% Completed | 15/19 [00:14<00:03, 1.04it/s]
Loading safetensors checkpoint shards: 84% Completed | 16/19 [00:15<00:02, 1.03it/s]
Loading safetensors checkpoint shards: 89% Completed | 17/19 [00:16<00:02, 1.00s/it]
Loading safetensors checkpoint shards: 95% Completed | 18/19 [00:17<00:00, 1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:18<00:00, 1.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:18<00:00, 1.05it/s]
[gpu=3] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=5] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=4] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=0] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=6] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=1] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=2] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=3] Memory pool end. avail mem=3.63 GB
[gpu=2] Memory pool end. avail mem=3.63 GB
[gpu=5] Memory pool end. avail mem=3.63 GB
[gpu=1] Memory pool end. avail mem=3.63 GB
[gpu=6] Memory pool end. avail mem=3.63 GB
[gpu=7] Memory pool end. avail mem=3.63 GB
[gpu=4] Memory pool end. avail mem=3.63 GB
[gpu=0] Memory pool end. avail mem=3.63 GB
[gpu=1] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=7] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=3] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=6] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=4] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=0] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=5] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=2] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
INFO: Started server process [28350]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:1234/ (Press CTRL+C to quit)
INFO: 127.0.0.1:55458 - "GET /get_model_info HTTP/1.1" 200 OK
[gpu=0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Reproduction
!python -m sglang.launch_server --model-path /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1 --served-model-name mixtral-8x7b-v0.1 --host 0.0.0.0 --port 1234 --tp 8 --context-length 8192 --max-running-requests 32 --max-num-reqs 32 --disable-cuda-graph --enable-p2p-check
Environment
Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA L4
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.161.07
PyTorch: 2.4.0+cu121
sglang: 0.2.13
flashinfer: 0.1.5+cu124torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.31.0
tqdm: 4.65.0
numpy: 1.23.5
aiohttp: 3.8.5
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 23.2
PIL: 9.4.0
psutil: 5.9.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 23.2.0
vllm: 0.5.4
multipart: 0.0.9
openai: 1.42.0
anthropic: 0.34.1
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU1 NODE X NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU2 NODE NODE X NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU4 SYS SYS SYS SYS X NODE NODE NODE 48-95,144-191 1 N/A
GPU5 SYS SYS SYS SYS NODE X NODE NODE 48-95,144-191 1 N/A
GPU6 SYS SYS SYS SYS NODE NODE X NODE 48-95,144-191 1 N/A
GPU7 SYS SYS SYS SYS NODE NODE NODE X 48-95,144-191 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1000000