sglang
sglang copied to clipboard
OOM CUDA error on 8 * L4 machine when launching sglang server
Hey!
I m trying launching a sglang server with OpenBioLLM 70b with the command python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 but I got on the 2 issues:
- It errors out with OOM CUDA, I tried playing around with all possible memory arguments but still have the issue, for e.g running
python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cacheerrors out, I tried decreasing the mem-fraction-static or try different values with tp but still fails, here is the error
`Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10014
server started on [0.0.0.0]:10017
server started on [0.0.0.0]:10018
server started on [0.0.0.0]:10016
server started on [0.0.0.0]:10019
server started on [0.0.0.0]:10015
server started on [0.0.0.0]:10020
server started on [0.0.0.0]:10021
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 55860) with fd 52
welcome ('127.0.0.1', 55860)
accepted ('127.0.0.1', 54770) with fd 32
welcome ('127.0.0.1', 54770)
accepted ('127.0.0.1', 37120) with fd 33
welcome ('127.0.0.1', 37120)
accepted ('127.0.0.1', 38382) with fd 28
welcome ('127.0.0.1', 38382)
accepted ('127.0.0.1', 57702) with fd 29
welcome ('127.0.0.1', 57702)
accepted ('127.0.0.1', 55900) with fd 24
welcome ('127.0.0.1', 55900)
accepted ('127.0.0.1', 37206) with fd 24
welcome ('127.0.0.1', 37206)
accepted ('127.0.0.1', 47836) with fd 24
welcome ('127.0.0.1', 47836)
Rank 4: load weight begin.
Rank 5: load weight begin.
Rank 7: load weight begin.
Rank 0: load weight begin.
Rank 6: load weight begin.
Rank 1: load weight begin.
Rank 2: load weight begin.
Rank 3: load weight begin.
Initialization failed. router_init_state: Traceback (most recent call last):
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 71, in start_router_process
model_client = ModelRpcClient(server_args, port_args, model_overide_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 724, in __init__
self.step = async_wrap("step")
^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 715, in async_wrap
fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 715, in <listcomp>
fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 707, in init_model
return self.remote_services[i].ModelRpcServer(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/netref.py", line 239, in __call__
return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/netref.py", line 63, in syncreq
return conn.sync_request(handler, proxy, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 744, in sync_request
return _async_res.value
^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/async_.py", line 111, in value
raise self._obj
rpyc.core.vinegar/torch.cuda._get_exception_class.<locals>.Derived: CUDA out of memory. Tried to allocate 112.00 MiB. GPU
========= Remote Traceback (1) =========
Traceback (most recent call last):
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 369, in _dispatch_request
res = self._HANDLERS[handler](self, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 863, in _handle_call
return obj(*args, **dict(kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 76, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 285, in __init__
self.load_model()
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 323, in load_model
model = model_class(
^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 257, in __init__
self.model = LlamaModel(config, quant_config=quant_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 217, in __init__
[
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 218, in <listcomp>
LlamaDecoderLayer(config, i, quant_config=quant_config)
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 166, in __init__
self.mlp = LlamaMLP(
^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 39, in __init__
self.gate_up_proj = MergedColumnParallelLinear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 333, in __init__
super().__init__(input_size, sum(output_sizes), bias, gather_output,
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 236, in __init__
self.quant_method.create_weights(self,
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights
weight = Parameter(torch.empty(output_size_per_partition,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU
Initialization failed. detoken_init_state: init ok
goodbye ('127.0.0.1', 57702)
goodbye ('127.0.0.1', 37206)
goodbye ('127.0.0.1', 55900)
goodbye ('127.0.0.1', 47836)
goodbye ('127.0.0.1', 54770)
goodbye ('127.0.0.1', 37120)
goodbye ('127.0.0.1', 38382)
goodbye ('127.0.0.1', 55860)`
- It stucks with
python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10007
server started on [0.0.0.0]:10004
server started on [0.0.0.0]:10005
server started on [0.0.0.0]:10008
server started on [0.0.0.0]:10006
server started on [0.0.0.0]:10009
server started on [0.0.0.0]:10010
server started on [0.0.0.0]:10011
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 44596) with fd 46
welcome ('127.0.0.1', 44596)
accepted ('127.0.0.1', 44648) with fd 33
welcome ('127.0.0.1', 44648)
accepted ('127.0.0.1', 53648) with fd 24
welcome ('127.0.0.1', 53648)
accepted ('127.0.0.1', 33128) with fd 25
welcome ('127.0.0.1', 33128)
accepted ('127.0.0.1', 41686) with fd 25
welcome ('127.0.0.1', 41686)
accepted ('127.0.0.1', 56570) with fd 25
welcome ('127.0.0.1', 56570)
accepted ('127.0.0.1', 48382) with fd 34
welcome ('127.0.0.1', 48382)
accepted ('127.0.0.1', 36272) with fd 29
welcome ('127.0.0.1', 36272)
Rank 4: load weight begin.
Rank 6: load weight begin.
Rank 2: load weight begin.
Rank 5: load weight begin.
Rank 3: load weight begin.
Rank 7: load weight begin.
Rank 1: load weight begin.
Rank 0: load weight begin.
^C
and when I do set_default_backend(RuntimeEndpoint("http://localhost:30000")) it errors out with connection refused
ConnectionRefusedError Traceback (most recent call last)
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
[1347](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1347) try:
-> [1348](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1348) h.request(req.get_method(), req.selector, req.data, headers,
[1349](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1349) encode_chunked=req.has_header('Transfer-encoding'))
[1350](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1350) except OSError as err: # timeout error
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1286, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
[1285](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1285) """Send a complete request to the server."""
-> [1286](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1286) self._send_request(method, url, body, headers, encode_chunked)
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1332, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
[1331](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1331) body = _encode(body, 'body')
-> [1332](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1332) self.endheaders(body, encode_chunked=encode_chunked)
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1281, in HTTPConnection.endheaders(self, message_body, encode_chunked)
[1280](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1280) raise CannotSendHeader()
-> [1281](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1281) self._send_output(message_body, encode_chunked=encode_chunked)
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1041, in HTTPConnection._send_output(self, message_body, encode_chunked)
[1040](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1040) del self._buffer[:]
-> [1041](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1041) self.send(msg)
[1043](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1043) if message_body is not None:
[1044](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1044)
...
-> [1351](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1351) raise URLError(err)
[1352](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1352) r = h.getresponse()
[1353](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1353) except:
URLError: <urlopen error [Errno 111] Connection refused>
Setup Machine type g2-standard-96 GPUs 8 x NVIDIA L4 Architecture x86/64
sglang version v0.1.16
It is not a memory problem as the machine has a total of 192 GB memory (24 GB/GPU) and I tried running inference without sglang and it worked. Plus, I haven't tried to use flashinfer as this is used to accelerate inference which is not the issue for me for now.
@mounamokaddem Try to decrease mem-fraction-static as sglang requires more free spaces to allocate when the tensor parallelism size is large.
@hnyls2002 I tried everything, as mentioned above I played around all combinations of values, for mem-fraction-static I tried from 0.1 to 0.9 with/without tensor parallelism but didn't work.
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
I have exactly same problem on 0.3.3-post1 and 0.3.2 version - this project is total garbage.
@hnyls2002 I tried everything, as mentioned above I played around all combinations of values, for
mem-fraction-staticI tried from 0.1 to 0.9 with/without tensor parallelism but didn't work. I have encountered the same problem, and after trying this method, there is no effect. Is there any solution now? @hnyls2002 @mounamokaddem