[RFC]Add an LLM engine
Motivation
This is not complete work, just a PoC and request for comment.
This PR adds a LLM engine, addressing the Roadmap item Add APIs for using the inference engine in a single script without launching a separate server.
The demo usage is in examples/usage/llm_engine.py:
from sglang import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The capital of China is",
"What is the meaning of life?",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="deepseek-ai/deepseek-llm-7b-chat")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
print('===============================')
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
Modification
It adds:
- a core
Engine, which wraps the core logic of what currentserver.launch_server()does, with addition of shutdown logic to gracefully bring down theZMQsockets in theTokenizationManagerwhen finishing the job. - The
class SamplingParamsis exposed as an API now(TheGenerateReqInputhas adictversion ofSamplingParams, and the interel logic use the class version, which means if we expose it as API, we need a circuitous transform from class to dict then to class, need to somehow fix later). - Along the way, I also add
EngineArgs, and make a newServerArgsa thin wrapper of it(seesglang/srt/serving/engine_args.pyandsglang/srt/serving/server_args.pyin the commit), and some config objects built fromEngineArgs, likeModelConfig,ScheduleConfig,ParallelConfig, etc, a mimic of vllm. This opens up an opportunity to clean up internal passing ofServerArgsarround many functions, and to draw a more clean APIs for different sub-components. But I didn't make this modification yet(these files are added, but take no effect now in the server code logic), it is quite intruisive to the current code base, thus I make this PR for RFC.
Checklist
- [ ] Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
- [x] Ensure pre-commit
pre-commit run --all-filesor other linting tools are used to fix potential lint issues. - [ ] Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
- [x] Modify documentation as needed, such as docstrings or example tutorials.
@JianyuZhan Can you fully verify locally before committing now? I'm currently troubleshooting CI issues, which will be affected.
Hi, @Ying1123 , @zhyncs , now this PR is complete, and passed all CI tests(The previous e2e-test failure is due to missing PYTHONPATH setting, so I added one in the e2e-test.yaml and the test passed) . It is keeping rebased to upstream/main branch and is ready for review.
This PR makes modifications as below:
- Added an
EngineandEngineArgs, in parallel withServerandServerArgs(all insglang/srt/serving/). And nowServerArgsis a thin wrapper onEngineArgs, with server-specific args like host, port, api key, and OpenAI API related stuff. AndServerArgstransparently pass all args belonging toEngineArgsto it, rendering users ofServerArgsEngineArgs-agnostic. AndServeris built onEngine, thus resulting in a succinctlaunch_serverAPI and implementation. - Based on 1, we can now have two serving methods. One is the old server api , and the other is the Engine API without running a server, see
examples/usage/llm_engine.py. Thus I put them in a standalone foldersglang/srt/serving/. - Along the way, we can now get rid of
ServerArgsin internal APIs(mangers/,model_executor/, etc). Instead we incorporateModelConfig,ScheduleConfig,ParallelConfig,OptimizationConfig,ObservabilityConfig(all created fromEngineArgsand built uponEnginecreation time). And those internal APIs are now communicating with each other using these*Configobjects; and it results in a quite coherent interfaces , in sense of the component abastraction and its dependency.
I have rebased upon the latest upstream/main branch, and it passed all CI tests now.
Your intention is to add an LLM Engine for offline use, may we just modify the necessary parts?
Do we need to redo the abstractions? Can't we just wrap a layer around the original interface and implement the LLM Engine instead?
One scenario where this API would be particularly useful is integrating it with the Triton Server Python Backend, similar to what's done here: https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/serve/turbomind/triton_python_backend/model.py. In other situations, using OpenAI Server could be a better option. In the above example, the encapsulation of the pipeline and the calling of the generate interface are very appropriate. Perhaps we can take it as a reference?
I guess @JianyuZhan was trying to not affect the existing code. But we should just try to review carefully instead of open another code path. The design principle is to not duplicate the code. @JianyuZhan Could you try to reuse existing modules as much as possible?
@zhyncs @Ying1123. Yes, the main goal is to provide a server-free (or offline) Engine, and the api is quite similar to the triton_python_backend you mentioned: https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/serve/turbomind/triton_python_backend/model.py#L113. You can refer to srt/serving/engine.py.
The srt/server*.py is moved, not copied to srt/serving/server*py, I rebased onto uptream branch and forgot to delete them.
This Engine abstraction naturally leads to separating the Engine (and its arguments) from the original Server and ServerArgs. Specifically, the Server is now built on top of the Engine. See the new launch_server():
def launch_server(
server_args: ServerArgs,
pipe_finish_writer: Optional[mp.connection.Connection] = None,
):
"""Launch an HTTP server."""
logging.basicConfig(
level=getattr(logging, server_args.log_level.upper()),
format="%(message)s",
)
global engine
engine = Engine.from_engine_args(server_args.engine_args)
if server_args.chat_template:
load_chat_template_for_openai_api(
engine.tokenizer_manager, server_args.chat_template
)
if server_args.file_storage_pth:
global file_storage_pth
file_storage_pth = server_args.file_storage_pth
# Add api key authorization
if server_args.api_key:
add_api_key_middleware(app, server_args.api_key)
# Send a warmup request
t = threading.Thread(
target=_wait_and_warmup, args=(server_args, pipe_finish_writer, os.getpid())
)
t.start()
try:
# Listen for requests
uvicorn.run(
app,
host=server_args.host,
port=server_args.port,
log_level=server_args.log_level_http or server_args.log_level,
timeout_keep_alive=5,
loop="uvloop",
)
finally:
t.join()
Now the server only cares about server-related stuff, all other details are now inside Engine.
Along this route, I found making the internal components Engine-centric is more tuititive. Thus I pushed the the second commit, "get rid of server args," which removes the Server-centric API in the internal components. And it turns out a quite cleaner API. But this second commit is mainly what I call this PR as RFC, and I just bring it up for discussion. And of course, we could still retain the server-centric abstraction.
@Ying1123 , I did try to reuse most current code. The addition/deleltion line comparsion is dauting, mostly because I forgot to remove the old serer*py, and now I re-push the new code.. Most of the change is just s/server_args/*config/g to get rid of the old server_args. I think most of the code logic is instact but just the interface changes. Hopefully I have clarified my intention in this PR. And if you prefer to retrain the server_args, I can rework another version.
@JianyuZhan Hi, I try use your repo to test, I clone the code with main branch , and run
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu118/torch2.4/
but when I run follow test
from sglang import LLM, SamplingParams
but is says
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sglang import LLM, SamplingParams
ImportError: cannot import name 'LLM' from 'sglang' (unknown location)
How can i fix it, looking for your help!
@DragonFive , Add "LLM" and "SamplingParams" in sglang/__init__.py::__all__, it is now imported in this init.py but not exposed, or you can try from sglang.api import LLM, SamplingParams. The code lags behind upstream and I will later rebase and repush.
`
@DragonFive , Add "LLM" and "SamplingParams" in
sglang/__init__.py::__all__, it is now imported in this init.py but not exposed, or you can tryfrom sglang.api import LLM, SamplingParams. The code lags behind upstream and I will later rebase and repush. `
It works fine for me, thanks for your contribution!
Running into the following issues (surfaced via tp_worker.py) when trying to query Llama 3.1 405B FP8 on an 8xH100 while setting tensor_parallel_size=8.
Note: requests to Llama 3.1 8B Instruct are successful (i.e. with tp=1, everything runs as intended)
llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
prompts = ["Hello, my name is"]
res = llm.generate(prompts)
[18:19:55 TP0] Decode batch. #running-req: 1, #token: 85, token usage: 0.00, gen throughput (token/s): 32.43, #queue-req: 0
[18:19:57 TP0] Decode batch. #running-req: 1, #token: 125, token usage: 0.00, gen throughput (token/s): 32.32, #queue-req: 0
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[18:19:57 TP2] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:9425: Connection reset by peer
Process Process-1:2:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:9425: Connection reset by peer
[18:19:57 TP1] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:59799: Connection reset by peer
[18:19:57 TP4] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:43139
Process Process-1:1:
Process Process-1:4:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
Traceback (most recent call last):
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:59799: Connection reset by peer
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:43139
[18:19:57 TP6] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:37813
Process Process-1:6:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:37813
[18:19:57 TP3] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:52914
Process Process-1:3:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:52914
[18:19:57 TP5] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:63369
Process Process-1:5:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:63369
>>> [18:19:57 TP7] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:56545
Process Process-1:7:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:56545
@jischein , Thanks for testing. I don't have this multi-GPU environment to test. Per my analysis, your error looks like the tp procs are not terminated properly in Engine::shutdown() . I tried to fix it, would you mind testing the new code I just pushed?
this PR will raise "AttributeError: 'Engine' object has no attribute 'tp_procs'" when do inference with one gpu, need add self.tp_procs=None in Engine.startup
@JianyuZhan unfortunately still running into errors after cleaning up the typos
llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
>>> prompts = ["Hi my name is"]
>>> res=llm.generate(prompts)
[17:53:56 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
[17:53:57 TP0] Decode batch. #running-req: 1, #token: 45, token usage: 0.00, gen throughput (token/s): 0.49, #queue-req: 0
[17:53:58 TP0] Decode batch. #running-req: 1, #token: 85, token usage: 0.00, gen throughput (token/s): 32.50, #queue-req: 0
[17:54:00 TP0] Decode batch. #running-req: 1, #token: 125, token usage: 0.00, gen throughput (token/s): 32.47, #queue-req: 0
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[17:54:00 TP1] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:51919: Connection reset by peer
Process Process-1:1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:51919: Connection reset by peer
[17:54:00 TP2] Exception in run_tp_server:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:2179: Connection reset by peer
https://github.com/JianyuZhan/sglang/pull/1 — @JianyuZhan this compiles / addresses the typo
JianyuZhan#1 — @JianyuZhan this compiles / addresses the typo
Change all the 'model_overide_args' to 'model_override_args' in the repo will work well.
@JianyuZhan It runs well before I ungrade sglang to v0.3.0 on llama3.1-8b, after that I encounter some confused error :
10:46:19.553 [10:46:19 TP0] Exception in ControllerSingle:
10:46:19.553 Traceback (most recent call last):
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 157, in start_controller_process
10:46:19.553 controller.loop_for_forward()
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 98, in loop_for_forward
10:46:19.553 out_pyobjs = self.tp_server.exposed_step(recv_reqs)
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 243, in exposed_step
10:46:19.553 self.forward_step()
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553 return func(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 259, in forward_step
10:46:19.553 self.forward_prefill_batch(new_batch)
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 506, in forward_prefill_batch
10:46:19.553 sample_output, logits_output = self.model_runner.forward(
10:46:19.553 File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 591, in forward
10:46:19.553 return self.forward_extend(batch)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553 return func(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 555, in forward_extend
10:46:19.553 return self.model.forward(
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553 return func(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/models/llama.py", line 317, in forward
10:46:19.553 hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553 return self._call_impl(*args, **kwargs)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553 return forward_call(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/models/llama.py", line 282, in forward
10:46:19.553 hidden_states, residual = layer(
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553 return self._call_impl(*args, **kwargs)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553 return forward_call(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/models/llama.py", line 232, in forward
10:46:19.553 hidden_states = self.self_attn(
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553 return self._call_impl(*args, **kwargs)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553 return forward_call(*args, **kwargs)
10:46:19.554 File "/github_sglang/python/sglang/srt/models/llama.py", line 168, in forward
10:46:19.554 q, k = self.rotary_emb(positions, q, k)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.554 return self._call_impl(*args, **kwargs)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.554 return forward_call(*args, **kwargs)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/custom_op.py", line 14, in forward
10:46:19.554 return self._forward_method(*args, **kwargs)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 216, in forward_cuda
10:46:19.554 ops.rotary_embedding(positions, query, key, self.head_size,
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 37, in wrapper
10:46:19.554 raise e
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 28, in wrapper
10:46:19.554 return fn(*args, **kwargs)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 138, in rotary_embedding
10:46:19.554 torch.ops._C.rotary_embedding(positions, query, key, head_size,
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/torch/_ops.py", line 1170, in __getattr__
10:46:19.554 raise AttributeError(
10:46:19.554 AttributeError: '_OpNamespace' '_C' object has no attribute 'rotary_embedding'
@DragonFive it is because the vllm dependency is upgraded, I think, you shall update your local installation dependency as well: pip install --upgrade pip pip install -e "python[all]", check the install section in the README.
Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine.
Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine.
Hi @yangky11 Maybe you can try this https://github.com/sgl-project/sglang/blob/05bea6883c4b3f2fb7f01287cd8dccefeacd545f/python/sglang/srt/server.py#L562
@JianyuZhan @zhyncs is this close to being merged? Would love to start using
moved to #1567
Although this PR was closed, we still appreciate @JianyuZhan 's contribution. Thanks!