sglang OOM CUDA error on 8 * L4 machine when launching sglang server

Hey!

I m trying launching a sglang server with OpenBioLLM 70b with the command python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 but I got on the 2 issues:

It errors out with OOM CUDA, I tried playing around with all possible memory arguments but still have the issue, for e.g running python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache errors out, I tried decreasing the mem-fraction-static or try different values with tp but still fails, here is the error

`Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10014
server started on [0.0.0.0]:10017
server started on [0.0.0.0]:10018
server started on [0.0.0.0]:10016
server started on [0.0.0.0]:10019
server started on [0.0.0.0]:10015
server started on [0.0.0.0]:10020
server started on [0.0.0.0]:10021
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 55860) with fd 52
welcome ('127.0.0.1', 55860)
accepted ('127.0.0.1', 54770) with fd 32
welcome ('127.0.0.1', 54770)
accepted ('127.0.0.1', 37120) with fd 33
welcome ('127.0.0.1', 37120)
accepted ('127.0.0.1', 38382) with fd 28
welcome ('127.0.0.1', 38382)
accepted ('127.0.0.1', 57702) with fd 29
welcome ('127.0.0.1', 57702)
accepted ('127.0.0.1', 55900) with fd 24
welcome ('127.0.0.1', 55900)
accepted ('127.0.0.1', 37206) with fd 24
welcome ('127.0.0.1', 37206)
accepted ('127.0.0.1', 47836) with fd 24
welcome ('127.0.0.1', 47836)
Rank 4: load weight begin.
Rank 5: load weight begin.
Rank 7: load weight begin.
Rank 0: load weight begin.
Rank 6: load weight begin.
Rank 1: load weight begin.
Rank 2: load weight begin.
Rank 3: load weight begin.
Initialization failed. router_init_state: Traceback (most recent call last):
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 71, in start_router_process
    model_client = ModelRpcClient(server_args, port_args, model_overide_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 724, in __init__
    self.step = async_wrap("step")
                ^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 715, in async_wrap
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 715, in <listcomp>
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 707, in init_model
    return self.remote_services[i].ModelRpcServer(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/netref.py", line 239, in __call__
    return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/netref.py", line 63, in syncreq
    return conn.sync_request(handler, proxy, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 744, in sync_request
    return _async_res.value
           ^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/async_.py", line 111, in value
    raise self._obj
rpyc.core.vinegar/torch.cuda._get_exception_class.<locals>.Derived: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 369, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/rpyc/core/protocol.py", line 863, in _handle_call
    return obj(*args, **dict(kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 76, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 285, in __init__
    self.load_model()
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 323, in load_model
    model = model_class(
            ^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 257, in __init__
    self.model = LlamaModel(config, quant_config=quant_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 217, in __init__
    [
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 218, in <listcomp>
    LlamaDecoderLayer(config, i, quant_config=quant_config)
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 166, in __init__
    self.mlp = LlamaMLP(
               ^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 39, in __init__
    self.gate_up_proj = MergedColumnParallelLinear(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 333, in __init__
    super().__init__(input_size, sum(output_sizes), bias, gather_output,
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 236, in __init__
    self.quant_method.create_weights(self,
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights
    weight = Parameter(torch.empty(output_size_per_partition,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mmokaddem_benchsci_com/.pyenv/versions/venv_sglang/lib/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 


Initialization failed. detoken_init_state: init ok
goodbye ('127.0.0.1', 57702)
goodbye ('127.0.0.1', 37206)
goodbye ('127.0.0.1', 55900)
goodbye ('127.0.0.1', 47836)
goodbye ('127.0.0.1', 54770)
goodbye ('127.0.0.1', 37120)
goodbye ('127.0.0.1', 38382)
goodbye ('127.0.0.1', 55860)`

It stucks with

python -m sglang.launch_server --model-path ~/Llama3-OpenBioLLM-70B-Instruct --port 30000 --mem-fraction-static 0.9 --tp 8 --disable-disk-cache
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10007
server started on [0.0.0.0]:10004
server started on [0.0.0.0]:10005
server started on [0.0.0.0]:10008
server started on [0.0.0.0]:10006
server started on [0.0.0.0]:10009
server started on [0.0.0.0]:10010
server started on [0.0.0.0]:10011
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 44596) with fd 46
welcome ('127.0.0.1', 44596)
accepted ('127.0.0.1', 44648) with fd 33
welcome ('127.0.0.1', 44648)
accepted ('127.0.0.1', 53648) with fd 24
welcome ('127.0.0.1', 53648)
accepted ('127.0.0.1', 33128) with fd 25
welcome ('127.0.0.1', 33128)
accepted ('127.0.0.1', 41686) with fd 25
welcome ('127.0.0.1', 41686)
accepted ('127.0.0.1', 56570) with fd 25
welcome ('127.0.0.1', 56570)
accepted ('127.0.0.1', 48382) with fd 34
welcome ('127.0.0.1', 48382)
accepted ('127.0.0.1', 36272) with fd 29
welcome ('127.0.0.1', 36272)
Rank 4: load weight begin.
Rank 6: load weight begin.
Rank 2: load weight begin.
Rank 5: load weight begin.
Rank 3: load weight begin.
Rank 7: load weight begin.
Rank 1: load weight begin.
Rank 0: load weight begin.
^C

and when I do set_default_backend(RuntimeEndpoint("http://localhost:30000")) it errors out with connection refused

ConnectionRefusedError                    Traceback (most recent call last)
File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   [1347](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1347) try:
-> [1348](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1348)     h.request(req.get_method(), req.selector, req.data, headers,
   [1349](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1349)               encode_chunked=req.has_header('Transfer-encoding'))
   [1350](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1350) except OSError as err: # timeout error

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1286, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
   [1285](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1285) """Send a complete request to the server."""
-> [1286](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1286) self._send_request(method, url, body, headers, encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1332, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
   [1331](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1331)     body = _encode(body, 'body')
-> [1332](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1332) self.endheaders(body, encode_chunked=encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1281, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   [1280](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1280)     raise CannotSendHeader()
-> [1281](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1281) self._send_output(message_body, encode_chunked=encode_chunked)

File ~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1041, in HTTPConnection._send_output(self, message_body, encode_chunked)
   [1040](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1040) del self._buffer[:]
-> [1041](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1041) self.send(msg)
   [1043](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1043) if message_body is not None:
   [1044](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/http/client.py:1044) 
...
-> [1351](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1351)         raise URLError(err)
   [1352](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1352)     r = h.getresponse()
   [1353](https://vscode-remote+ssh-002dremote-002bcloudstation-002dmmokaddem-002d3.vscode-resource.vscode-cdn.net/home/mmokaddem_benchsci_com/github/benchsci/bsci/benchsci/extract/evidence_maps/notebooks/~/github/benchsci/bsci/bazel-bin/tools/virtualenv.runfiles/rules_python~0.28.0~python~python_3_11_x86_64-unknown-linux-gnu/lib/python3.11/urllib/request.py:1353) except:

URLError: <urlopen error [Errno 111] Connection refused>

Setup Machine type g2-standard-96 GPUs 8 x NVIDIA L4 Architecture x86/64

sglang version v0.1.16

It is not a memory problem as the machine has a total of 192 GB memory (24 GB/GPU) and I tried running inference without sglang and it worked. Plus, I haven't tried to use flashinfer as this is used to accelerate inference which is not the issue for me for now.

May 15 '24 21:05 mounamokaddem

@mounamokaddem Try to decrease mem-fraction-static as sglang requires more free spaces to allocate when the tensor parallelism size is large.

May 16 '24 03:05 hnyls2002

@hnyls2002 I tried everything, as mentioned above I played around all combinations of values, for mem-fraction-static I tried from 0.1 to 0.9 with/without tensor parallelism but didn't work.

May 16 '24 12:05 mounamokaddem

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Jul 25 '24 06:07 github-actions[bot]

I have exactly same problem on 0.3.3-post1 and 0.3.2 version - this project is total garbage.

Oct 13 '24 21:10 cyberluke

@hnyls2002 I tried everything, as mentioned above I played around all combinations of values, for mem-fraction-static I tried from 0.1 to 0.9 with/without tensor parallelism but didn't work. I have encountered the same problem, and after trying this method, there is no effect. Is there any solution now? @hnyls2002 @mounamokaddem

Oct 25 '24 03:10 Hutlustc

sglang sglang copied to clipboard

OOM CUDA error on 8 * L4 machine when launching sglang server

sglang
sglang copied to clipboard