sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] Server crashes after loading (Mixtral 8x7b) on L4

Open nivibilla opened this issue 6 months ago • 9 comments

Checklist

  • [X] 1. I have searched related issues but cannot get the expected help.
  • [X] 2. The bug has not been fixed in the latest version.
  • [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [X] 5. Please use English, otherwise it will be closed.

Describe the bug

Model fully loads, server runs and then instantly crashes

server_args=ServerArgs(model_path='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_path='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', trust_remote_code=False, context_length=8192, quantization=None, served_model_name='mixtral-8x7b-v0.1', chat_template=None, host='0.0.0.0', port=1234, additional_ports=[1235, 1236, 1237, 1238], mem_fraction_static=0.83, max_running_requests=32, max_num_reqs=32, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=8, stream_interval=1, random_seed=759329088, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=True, enable_mla=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu=0] Init nccl begin.
[gpu=5] Init nccl begin.
[gpu=7] Init nccl begin.
[gpu=1] Init nccl begin.
[gpu=3] Init nccl begin.
[gpu=6] Init nccl begin.
[gpu=2] Init nccl begin.
[gpu=4] Init nccl begin.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-23 11:04:07 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[gpu=6] Load weight begin. avail mem=21.65 GB
[gpu=5] Load weight begin. avail mem=21.65 GB
[gpu=7] Load weight begin. avail mem=21.65 GB
[gpu=4] Load weight begin. avail mem=21.65 GB
[gpu=3] Load weight begin. avail mem=21.65 GB
[gpu=1] Load weight begin. avail mem=21.65 GB
[gpu=0] Load weight begin. avail mem=21.65 GB
[gpu=2] Load weight begin. avail mem=21.65 GB
Loading safetensors checkpoint shards:   0% Completed | 0/19 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   5% Completed | 1/19 [00:00<00:13,  1.37it/s]
Loading safetensors checkpoint shards:  11% Completed | 2/19 [00:01<00:14,  1.19it/s]
Loading safetensors checkpoint shards:  16% Completed | 3/19 [00:02<00:14,  1.14it/s]
Loading safetensors checkpoint shards:  21% Completed | 4/19 [00:03<00:13,  1.07it/s]
Loading safetensors checkpoint shards:  26% Completed | 5/19 [00:04<00:13,  1.02it/s]
Loading safetensors checkpoint shards:  32% Completed | 6/19 [00:05<00:13,  1.01s/it]
Loading safetensors checkpoint shards:  37% Completed | 7/19 [00:06<00:12,  1.01s/it]
[gpu=7] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
Loading safetensors checkpoint shards:  42% Completed | 8/19 [00:07<00:11,  1.02s/it]
Loading safetensors checkpoint shards:  47% Completed | 9/19 [00:08<00:09,  1.01it/s]
Loading safetensors checkpoint shards:  53% Completed | 10/19 [00:09<00:08,  1.08it/s]
Loading safetensors checkpoint shards:  58% Completed | 11/19 [00:10<00:07,  1.08it/s]
Loading safetensors checkpoint shards:  63% Completed | 12/19 [00:11<00:06,  1.07it/s]
Loading safetensors checkpoint shards:  68% Completed | 13/19 [00:12<00:05,  1.07it/s]
Loading safetensors checkpoint shards:  74% Completed | 14/19 [00:13<00:04,  1.04it/s]
Loading safetensors checkpoint shards:  79% Completed | 15/19 [00:14<00:03,  1.04it/s]
Loading safetensors checkpoint shards:  84% Completed | 16/19 [00:15<00:02,  1.03it/s]
Loading safetensors checkpoint shards:  89% Completed | 17/19 [00:16<00:02,  1.00s/it]
Loading safetensors checkpoint shards:  95% Completed | 18/19 [00:17<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:18<00:00,  1.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:18<00:00,  1.05it/s]

[gpu=3] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=5] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=4] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=0] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=6] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=1] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=2] Load weight end. type=MixtralForCausalLM, dtype=torch.bfloat16, avail mem=10.75 GB
[gpu=3] Memory pool end. avail mem=3.63 GB
[gpu=2] Memory pool end. avail mem=3.63 GB
[gpu=5] Memory pool end. avail mem=3.63 GB
[gpu=1] Memory pool end. avail mem=3.63 GB
[gpu=6] Memory pool end. avail mem=3.63 GB
[gpu=7] Memory pool end. avail mem=3.63 GB
[gpu=4] Memory pool end. avail mem=3.63 GB
[gpu=0] Memory pool end. avail mem=3.63 GB
[gpu=1] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=7] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=3] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=6] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=4] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=0] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=5] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
[gpu=2] max_total_num_tokens=463405, max_prefill_tokens=16384, max_running_requests=31, context_len=8192
INFO:     Started server process [28350]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1234/ (Press CTRL+C to quit)
INFO:     127.0.0.1:55458 - "GET /get_model_info HTTP/1.1" 200 OK
[gpu=0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Reproduction

!python -m sglang.launch_server --model-path /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1 --served-model-name mixtral-8x7b-v0.1 --host 0.0.0.0 --port 1234 --tp 8 --context-length 8192 --max-running-requests 32 --max-num-reqs 32 --disable-cuda-graph --enable-p2p-check

Environment

Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA L4
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.161.07
PyTorch: 2.4.0+cu121
sglang: 0.2.13
flashinfer: 0.1.5+cu124torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.31.0
tqdm: 4.65.0
numpy: 1.23.5
aiohttp: 3.8.5
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 23.2
PIL: 9.4.0
psutil: 5.9.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 23.2.0
vllm: 0.5.4
multipart: 0.0.9
openai: 1.42.0
anthropic: 0.34.1
NVIDIA Topology: 
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU1	NODE	 X 	NODE	NODE	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU2	NODE	NODE	 X 	NODE	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU3	NODE	NODE	NODE	 X 	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU4	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE	48-95,144-191	1		N/A
GPU5	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE	48-95,144-191	1		N/A
GPU6	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	48-95,144-191	1		N/A
GPU7	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	48-95,144-191	1		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1000000

nivibilla avatar Aug 23 '24 11:08 nivibilla