sglang Slow weight loading

Whenever I try to load the Mixtral models it takes very long and at the end instead of actually starting the server I get a similar error as the one here - https://github.com/sgl-project/sglang/issues/99 The same model save works in vLLM. My setup - 2xA100-80GBs

Jan 30 '24 14:01 aflah02

It works on my setup (8x A10G). You can add a print statement here https://github.com/sgl-project/sglang/commit/c7fd000debcfce2c3e3f55932c51951420a5f92d to see whether the weight loading is working correctly.

You may also need to use --tp 2 and make sure the process can see two GPUs.

Jan 30 '24 15:01 merrymercy

Thanks! I'll try this out. How long did the model loading take for you btw?

Jan 30 '24 16:01 aflah02

@merrymercy So I ran this command and it seems the loading does complete but it's stuck here -

....
Loading model.layers.25.block_sparse_moe.experts.3.w3.weight
Loading model.layers.25.block_sparse_moe.experts.4.w1.weight
Loading model.layers.25.block_sparse_moe.experts.4.w2.weight
Loading model.layers.25.block_sparse_moe.experts.4.w3.weight
Loading model.layers.25.block_sparse_moe.experts.5.w1.weight
Loading model.layers.25.block_sparse_moe.experts.5.w2.weight
Loading model.layers.25.block_sparse_moe.experts.5.w3.weight
Loading model.layers.25.block_sparse_moe.experts.6.w1.weight
Loading model.layers.25.block_sparse_moe.experts.6.w2.weight
Loading model.layers.25.block_sparse_moe.experts.6.w3.weight
Loading model.layers.25.block_sparse_moe.experts.7.w1.weight
Loading model.layers.25.block_sparse_moe.experts.7.w2.weight
Rank 0: load weight end.
Rank 0: max_total_num_token=381017, max_prefill_num_token=63502, context_len=32768, model_mode=[]
Rank 1: max_total_num_token=381017, max_prefill_num_token=63502, context_len=32768, model_mode=[]
goodbye ('127.0.0.1', 42394)
goodbye ('127.0.0.1', 50028)

The GPU is occupied btw -

Jan 30 '24 16:01 aflah02

Something similar happens for llama-65b while llama-2-chat-70b loads just fine

Jan 30 '24 17:01 aflah02

It might be a GPU issue as I tried running these models on different nodes (same config). When I run the llama-65b one on the node where llama-2-chat-70b worked, it also seems to work fine. I'll test again with Mixtral on the same node.

Jan 30 '24 18:01 aflah02

Just tried, mixtral works on the other node but its very flaky. Sometimes the same model loads up quickly while other times it doesn't. Do you know what might be going on here @merrymercy? I also tried 2 mixtral finetunes but haven't been able to load them so far

Jan 30 '24 21:01 aflah02

I cannot reproduce this so it would be great if you could help us do some debugging.

Could you please add more print statements to help us identify the line where the program hangs? Is it because of disk problems, or CUDA problems, or RPC problems?

Could you try a smaller --mem-fraction-static (help)?

Here are some pointers:

Weight loading https://github.com/sgl-project/sglang/blob/71b54eea7d21a2bb1d8ef340e7002983a29b1d5f/python/sglang/srt/models/mixtral.py#L338-L348
RPC calls https://github.com/sgl-project/sglang/blob/71b54eea7d21a2bb1d8ef340e7002983a29b1d5f/python/sglang/srt/managers/router/model_rpc.py#L556-L566
RPC init https://github.com/sgl-project/sglang/blob/71b54eea7d21a2bb1d8ef340e7002983a29b1d5f/python/sglang/srt/managers/router/model_rpc.py#L587, https://github.com/sgl-project/sglang/blob/71b54eea7d21a2bb1d8ef340e7002983a29b1d5f/python/sglang/srt/managers/router/model_rpc.py#L601 I am not sure whether sync_request_timeout or other configs is related

Jan 31 '24 07:01 merrymercy

Thanks for the pointers @merrymercy

I think I have a theory as to what might be going wrong. I'll probe the different parts of the codebase after the ICML deadline but here's what I have so far -

The first time I load a model it takes very long for large models (which is understandably because I'm I/O bottlenecked and I had the same issue with a plain HF/vLLM implementation)

When the model finally loads up the API server does not start, however if I quickly kill the process once it's loaded and rerun the command the loading is a lot faster (presumably due to caching), and the API server also starts. So, I think the API server waits for some fixed time and if the model is not loaded within that time, the server does not work properly.

Jan 31 '24 07:01 aflah02

@merrymercy I just wanted to add to this that I have run into the same or a very similar problem as well - large models just would not load, and eventually would result in a timeout. I've run this through a debugger, and it seems the timeout is precisely coming from the RPC "sync_request_timeout" you mention above (it's already set to 1800 now, but for me it sometimes still timed out). The model runner thread is still doing its thing though when that happens (ie. iterating through weights in model.load_weights()).

For me, this happens if I try to load the model from a distributed network filesystem. It's not very reproducible - it seems how long it takes to load a model fluctuates wildly, and sometimes even a 70b model loads in less than the 1800s timeout, even from that network drive. Copying the model weights to a local SSD solves the problem pretty consistently.

So it seems there isn't a bug in sglang, for me at least it's just a network filesystem issue.

I did notice that loading weights from the network filesystem takes much longer than copying them from there to the local SSD. And if I understand the code correctly, each model worker loads the weights from disk separately. It also seem that the workers don't all load them in the same order from a quick test. So there might be some potential for optimization there (e.g., load each weight to RAM once, and then copy it to each GPU worker from there). But maybe it's not a high priority.

May 07 '24 19:05 mgerstgrasser

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Jul 25 '24 06:07 github-actions[bot]