sglang
sglang copied to clipboard
/generate request possibly hanging when `CUDA out of memory` is thrown
I've run_batch
with 1000 items, num_threads=200
. I notice that the batch processing gets stuck at 98%, then the server shows no more console logs. I checked the full log and I see some CUDA out of memory
errors.
Therefore, I suspect that if this error is thrown, the /generate
request might be left hanging. I've added retry to http_request
(check my pull request) and it still gets stuck. So this is why I suspect such requests may be hanging instead of failing, because if they were to fail, retry mechanism would have kicked in.
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 42069 --host 0.0.0.0 --tp-size 1 --mem-fraction-static 0.8
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 184, in exposed_step
self.forward_step()
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 199, in forward_step
self.forward_fill_batch(new_batch)
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 412, in forward_fill_batch
) = self.model_runner.forward(
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 510, in forward
return self.forward_extend(**kwargs)
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 415, in forward_extend
return self.model.forward(input_ids, input_metadata.positions, input_metadata)
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 270, in forward
return self.logits_processor(
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 50, in forward
all_logprobs = torch.log(torch.softmax(logits.float(), dim=-1) + 1e-6)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.07 GiB. GPU 0 has a total capacty of 23.68 GiB of which 1.25 GiB is free. Process 82733 has 22.42 GiB memory in use. Of the allocated memory 21.89 GiB is allocated by PyTorch, and 230.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Full server log here: log.txt