sglang [RFC]Add an LLM engine

Motivation

This is not complete work, just a PoC and request for comment.

This PR adds a LLM engine, addressing the Roadmap item Add APIs for using the inference engine in a single script without launching a separate server.

The demo usage is in examples/usage/llm_engine.py:

from sglang import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of China is",
    "What is the meaning of life?",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="deepseek-ai/deepseek-llm-7b-chat")

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
    print('===============================')
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Modification

It adds:

a core Engine, which wraps the core logic of what current server.launch_server() does, with addition of shutdown logic to gracefully bring down the ZMQ sockets in the TokenizationManager when finishing the job.
The class SamplingParams is exposed as an API now(The GenerateReqInput has a dict version of SamplingParams, and the interel logic use the class version, which means if we expose it as API, we need a circuitous transform from class to dict then to class, need to somehow fix later).
Along the way, I also add EngineArgs, and make a new ServerArgs a thin wrapper of it(see sglang/srt/serving/engine_args.py and sglang/srt/serving/server_args.py in the commit), and some config objects built from EngineArgs, like ModelConfig, ScheduleConfig, ParallelConfig, etc, a mimic of vllm. This opens up an opportunity to clean up internal passing of ServerArgs arround many functions, and to draw a more clean APIs for different sub-components. But I didn't make this modification yet(these files are added, but take no effect now in the server code logic), it is quite intruisive to the current code base, thus I make this PR for RFC.

Checklist

[ ] Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
[x] Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
[ ] Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
[x] Modify documentation as needed, such as docstrings or example tutorials.

Aug 16 '24 17:08 JianyuZhan

@JianyuZhan Can you fully verify locally before committing now? I'm currently troubleshooting CI issues, which will be affected.

Aug 17 '24 12:08 zhyncs

Hi, @Ying1123 , @zhyncs , now this PR is complete, and passed all CI tests(The previous e2e-test failure is due to missing PYTHONPATH setting, so I added one in the e2e-test.yaml and the test passed) . It is keeping rebased to upstream/main branch and is ready for review.

This PR makes modifications as below:

Added an Engine and EngineArgs, in parallel with Server and ServerArgs (all in sglang/srt/serving/). And now ServerArgs is a thin wrapper on EngineArgs, with server-specific args like host, port, api key, and OpenAI API related stuff. And ServerArgs transparently pass all args belonging to EngineArgs to it, rendering users of ServerArgs EngineArgs-agnostic. And Server is built on Engine, thus resulting in a succinct launch_server API and implementation.
Based on 1, we can now have two serving methods. One is the old server api , and the other is the Engine API without running a server, see examples/usage/llm_engine.py. Thus I put them in a standalone folder sglang/srt/serving/.
Along the way, we can now get rid of ServerArgs in internal APIs(mangers/, model_executor/, etc). Instead we incorporate ModelConfig , ScheduleConfig, ParallelConfig, OptimizationConfig, ObservabilityConfig (all created from EngineArgs and built upon Engine creation time). And those internal APIs are now communicating with each other using these *Config objects; and it results in a quite coherent interfaces , in sense of the component abastraction and its dependency.

Aug 18 '24 05:08 JianyuZhan

I have rebased upon the latest upstream/main branch, and it passed all CI tests now.

Aug 19 '24 04:08 JianyuZhan

Your intention is to add an LLM Engine for offline use, may we just modify the necessary parts?

Aug 21 '24 07:08 zhyncs

Do we need to redo the abstractions? Can't we just wrap a layer around the original interface and implement the LLM Engine instead?

Aug 21 '24 07:08 zhyncs

One scenario where this API would be particularly useful is integrating it with the Triton Server Python Backend, similar to what's done here: https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/serve/turbomind/triton_python_backend/model.py. In other situations, using OpenAI Server could be a better option. In the above example, the encapsulation of the pipeline and the calling of the generate interface are very appropriate. Perhaps we can take it as a reference?

Aug 21 '24 07:08 zhyncs

I guess @JianyuZhan was trying to not affect the existing code. But we should just try to review carefully instead of open another code path. The design principle is to not duplicate the code. @JianyuZhan Could you try to reuse existing modules as much as possible?

Aug 21 '24 08:08 Ying1123

@zhyncs @Ying1123. Yes, the main goal is to provide a server-free (or offline) Engine, and the api is quite similar to the triton_python_backend you mentioned: https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/serve/turbomind/triton_python_backend/model.py#L113. You can refer to srt/serving/engine.py.

The srt/server*.py is moved, not copied to srt/serving/server*py, I rebased onto uptream branch and forgot to delete them.

This Engine abstraction naturally leads to separating the Engine (and its arguments) from the original Server and ServerArgs. Specifically, the Server is now built on top of the Engine. See the new launch_server():

def launch_server(
    server_args: ServerArgs,
    pipe_finish_writer: Optional[mp.connection.Connection] = None,
):
    """Launch an HTTP server."""

    logging.basicConfig(
        level=getattr(logging, server_args.log_level.upper()),
        format="%(message)s",
    )

    global engine
    engine = Engine.from_engine_args(server_args.engine_args)

    if server_args.chat_template:
        load_chat_template_for_openai_api(
            engine.tokenizer_manager, server_args.chat_template
        )

    if server_args.file_storage_pth:
        global file_storage_pth
        file_storage_pth = server_args.file_storage_pth

    # Add api key authorization
    if server_args.api_key:
        add_api_key_middleware(app, server_args.api_key)

    # Send a warmup request
    t = threading.Thread(
        target=_wait_and_warmup, args=(server_args, pipe_finish_writer, os.getpid())
    )
    t.start()

    try:
        # Listen for requests
        uvicorn.run(
            app,
            host=server_args.host,
            port=server_args.port,
            log_level=server_args.log_level_http or server_args.log_level,
            timeout_keep_alive=5,
            loop="uvloop",
        )
    finally:
        t.join()

Now the server only cares about server-related stuff, all other details are now inside Engine.

Along this route, I found making the internal components Engine-centric is more tuititive. Thus I pushed the the second commit, "get rid of server args," which removes the Server-centric API in the internal components. And it turns out a quite cleaner API. But this second commit is mainly what I call this PR as RFC, and I just bring it up for discussion. And of course, we could still retain the server-centric abstraction.

Aug 21 '24 08:08 JianyuZhan

@Ying1123 , I did try to reuse most current code. The addition/deleltion line comparsion is dauting, mostly because I forgot to remove the old serer*py, and now I re-push the new code.. Most of the change is just s/server_args/*config/g to get rid of the old server_args. I think most of the code logic is instact but just the interface changes. Hopefully I have clarified my intention in this PR. And if you prefer to retrain the server_args, I can rework another version.

Aug 21 '24 09:08 JianyuZhan

@JianyuZhan Hi, I try use your repo to test, I clone the code with main branch , and run

pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu118/torch2.4/

but when I run follow test

from sglang import LLM, SamplingParams

but is says

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sglang import LLM, SamplingParams

ImportError: cannot import name 'LLM' from 'sglang' (unknown location)

How can i fix it, looking for your help!

Aug 29 '24 05:08 DragonFive

@DragonFive , Add "LLM" and "SamplingParams" in sglang/__init__.py::__all__, it is now imported in this init.py but not exposed, or you can try from sglang.api import LLM, SamplingParams. The code lags behind upstream and I will later rebase and repush. `

Aug 29 '24 05:08 JianyuZhan

@DragonFive , Add "LLM" and "SamplingParams" in sglang/__init__.py::__all__, it is now imported in this init.py but not exposed, or you can try from sglang.api import LLM, SamplingParams. The code lags behind upstream and I will later rebase and repush. `

It works fine for me, thanks for your contribution!

Aug 29 '24 07:08 DragonFive

Running into the following issues (surfaced via tp_worker.py) when trying to query Llama 3.1 405B FP8 on an 8xH100 while setting tensor_parallel_size=8.
Note: requests to Llama 3.1 8B Instruct are successful (i.e. with tp=1, everything runs as intended)

llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
prompts = ["Hello, my name is"]
res = llm.generate(prompts)
[18:19:55 TP0] Decode batch. #running-req: 1, #token: 85, token usage: 0.00, gen throughput (token/s): 32.43, #queue-req: 0
[18:19:57 TP0] Decode batch. #running-req: 1, #token: 125, token usage: 0.00, gen throughput (token/s): 32.32, #queue-req: 0
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[18:19:57 TP2] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:9425: Connection reset by peer

Process Process-1:2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:9425: Connection reset by peer
[18:19:57 TP1] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:59799: Connection reset by peer

[18:19:57 TP4] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:43139

Process Process-1:1:
Process Process-1:4:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
Traceback (most recent call last):
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:59799: Connection reset by peer
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:43139
[18:19:57 TP6] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:37813

Process Process-1:6:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:37813
[18:19:57 TP3] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:52914

Process Process-1:3:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:52914
[18:19:57 TP5] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:63369

Process Process-1:5:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:63369
>>> [18:19:57 TP7] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:56545

Process Process-1:7:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:56545

Aug 29 '24 18:08 jischein

@jischein , Thanks for testing. I don't have this multi-GPU environment to test. Per my analysis, your error looks like the tp procs are not terminated properly in Engine::shutdown() . I tried to fix it, would you mind testing the new code I just pushed?

Aug 30 '24 03:08 JianyuZhan

this PR will raise "AttributeError: 'Engine' object has no attribute 'tp_procs'" when do inference with one gpu, need add self.tp_procs=None in Engine.startup

Aug 30 '24 06:08 feifei-111

@JianyuZhan unfortunately still running into errors after cleaning up the typos

llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
>>> prompts = ["Hi my name is"]
>>> res=llm.generate(prompts)
[17:53:56 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
[17:53:57 TP0] Decode batch. #running-req: 1, #token: 45, token usage: 0.00, gen throughput (token/s): 0.49, #queue-req: 0
[17:53:58 TP0] Decode batch. #running-req: 1, #token: 85, token usage: 0.00, gen throughput (token/s): 32.50, #queue-req: 0
[17:54:00 TP0] Decode batch. #running-req: 1, #token: 125, token usage: 0.00, gen throughput (token/s): 32.47, #queue-req: 0
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[17:54:00 TP1] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:51919: Connection reset by peer

Process Process-1:1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:51919: Connection reset by peer
[17:54:00 TP2] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:2179: Connection reset by peer

Sep 01 '24 17:09 jischein

https://github.com/JianyuZhan/sglang/pull/1 — @JianyuZhan this compiles / addresses the typo

Sep 01 '24 18:09 jischein

JianyuZhan#1 — @JianyuZhan this compiles / addresses the typo

Change all the 'model_overide_args' to 'model_override_args' in the repo will work well.

Sep 02 '24 01:09 DragonFive

@JianyuZhan It runs well before I ungrade sglang to v0.3.0 on llama3.1-8b, after that I encounter some confused error ：

10:46:19.553 [10:46:19 TP0] Exception in ControllerSingle:
10:46:19.553 Traceback (most recent call last):
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 157, in start_controller_process
10:46:19.553     controller.loop_for_forward()
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 98, in loop_for_forward
10:46:19.553     out_pyobjs = self.tp_server.exposed_step(recv_reqs)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 243, in exposed_step
10:46:19.553     self.forward_step()
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 259, in forward_step
10:46:19.553     self.forward_prefill_batch(new_batch)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 506, in forward_prefill_batch
10:46:19.553     sample_output, logits_output = self.model_runner.forward(
10:46:19.553   File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 591, in forward
10:46:19.553     return self.forward_extend(batch)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 555, in forward_extend
10:46:19.553     return self.model.forward(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 317, in forward
10:46:19.553     hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 282, in forward
10:46:19.553     hidden_states, residual = layer(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 232, in forward
10:46:19.553     hidden_states = self.self_attn(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.554   File "/github_sglang/python/sglang/srt/models/llama.py", line 168, in forward
10:46:19.554     q, k = self.rotary_emb(positions, q, k)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.554     return self._call_impl(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.554     return forward_call(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/custom_op.py", line 14, in forward
10:46:19.554     return self._forward_method(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 216, in forward_cuda
10:46:19.554     ops.rotary_embedding(positions, query, key, self.head_size,
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 37, in wrapper
10:46:19.554     raise e
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 28, in wrapper
10:46:19.554     return fn(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 138, in rotary_embedding
10:46:19.554     torch.ops._C.rotary_embedding(positions, query, key, head_size,
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/_ops.py", line 1170, in __getattr__
10:46:19.554     raise AttributeError(
10:46:19.554 AttributeError: '_OpNamespace' '_C' object has no attribute 'rotary_embedding'

Sep 06 '24 02:09 DragonFive

@DragonFive it is because the vllm dependency is upgraded, I think, you shall update your local installation dependency as well: pip install --upgrade pip pip install -e "python[all]", check the install section in the README.

Sep 06 '24 03:09 JianyuZhan

Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine.

Sep 07 '24 14:09 yangky11

Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine.

Hi @yangky11 Maybe you can try this https://github.com/sgl-project/sglang/blob/05bea6883c4b3f2fb7f01287cd8dccefeacd545f/python/sglang/srt/server.py#L562

Sep 08 '24 20:09 zhyncs

@JianyuZhan @zhyncs is this close to being merged? Would love to start using

Sep 13 '24 20:09 jischein

moved to #1567

Oct 06 '24 01:10 merrymercy

Although this PR was closed, we still appreciate @JianyuZhan 's contribution. Thanks!

Oct 06 '24 01:10 zhyncs