sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] [SECURITY] Critical security incidents of SGLang

Open avilum opened this issue 8 months ago • 19 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

Hey, I have reported a vulnerability to the maintainers. I have asked for contact details 2+ months ago, in the Issue, and followed your official instructions. Afterwards, the communication was poorly done over email. I have shared the report and got confirmation that you are looking into it, but never got any response since then.

Now that SG-Lang is part of the PyTorch family, who is in charge of this project's security and handling such reports?

avilum avatar Apr 20 '25 09:04 avilum

@avilum May you submit a PR to fix it? Thanks!

zhyncs avatar Apr 21 '25 00:04 zhyncs

@zhyncs @zhaochenyang20 @Ying1123 @yings-db I was under the impression you have been fixing it for the past 2 months. You answered my email clarifying you are looking into it. I am very surprised now.

It is uncommon for researchers to write the security patches for vulnerabilities they report themselves. This vulnerability is on X.AI and YOU to solve, not me. It is your responsibility is to solve it, not the users'. You currently place your users at risk instead of responsibly fixing it like every organization or project ever does (do you know what a CVE means? Did you work on products prior to SGLang?

I would love to help with the fix and guidelines, but that's everyone's responsibility, not mine. You don't seem to understand the impact of the vulnerability or understand how vulnerabilities responsible disclosure works. https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html

Who is in charge of this project's security? Have you ever fixed a security vulnerability, or know what is the meaning of vulnerability?

SGlang users are currently in risk of remote code execution (so is X.AI)

avilum avatar Apr 21 '25 10:04 avilum

I forward it to @adarshxs

zhaochenyang20 avatar Apr 21 '25 19:04 zhaochenyang20

Several critical vulnerabilities and around a dozen high-severity vulnerabilities were detected by the Aqua scanner in our production environment. At the request of our security team, we continuously remediate vulnerabilities on our own. However, the delay in updating images in production can be up to 2 months.

Some of the vulnerabilities are related to the ssh server included in the image and the version of Pillow. Here’s how to fix them:

FROM lmsysorg/sglang:v0.4.5.post3-cu125

RUN apt-get update &&
apt-get install -y libjpeg-dev zlib1g-dev &&
rm -rf /var/lib/apt/lists/*

RUN pip install sglang-router

RUN sed -i -e 's/Pillow==8.3.1/Pillow==11.2.1/g' /opt/hpcx/clusterkit/bin/output/requirements.txt

RUN rm /etc/ssh/ssh_host_ecdsa_key
/etc/ssh/ssh_host_ed25519_key
/etc/ssh/ssh_host_rsa_key
/etc/ssh/ssh_host_ecdsa_key.pub
/etc/ssh/ssh_host_ed25519_key.pub
/etc/ssh/ssh_host_rsa_key.pub

Image Image Image Image

Swipe4057 avatar Apr 22 '25 18:04 Swipe4057

Also, I’d like to remind you that using Docker images in a production environment with critical and high-severity vulnerabilities is STRICTLY PROHIBITED! Because of this, we constantly have to fix sglang images ourselves, followed by a lengthy approval process with the information security department, which can take 1-2 months. (And by that time, the sglang version becomes outdated, yes.)

Swipe4057 avatar Apr 22 '25 18:04 Swipe4057

@adarshxs Adarsh is on this. Thanks!

zhaochenyang20 avatar Apr 22 '25 23:04 zhaochenyang20

will be submitting relevant fixes. Thank you

adarshxs avatar Apr 23 '25 02:04 adarshxs

@zhyncs @adarshxs I've conducted numerous security scans on the Docker images of sglang and found that many vulnerabilities in the base image originate from the child Docker image nvcr.io/nvidia/tritonserver:24.04-py3-min, which is used during the build (it has 69 vulnerabilities with exploitable exploits). This is the relevant line of code: https://github.com/sgl-project/sglang/blob/b5be56944b6eb61b44866011f157e8df0e563bd7/docker/Dockerfile#L3 Thus, it's severely outdated.

The newer Docker image nvcr.io/nvidia/tritonserver:25.03-py3-min has only 17 vulnerabilities, 9 of which can be fixed by upgrading Pillow to version 11.2.1. Therefore, building the sglang Docker image on this base would significantly improve security by changing just one line of code! Could you advise if there are any blockers for this?

Swipe4057 avatar Apr 24 '25 19:04 Swipe4057

@Swipe4057 Thank you for your evaluations. Please feel free to open PRs to support these new images. We will review the same. The current vulnerability @avilum mentions is related to another aspect in the codebase that we hope to fix soon!

adarshxs avatar Apr 24 '25 20:04 adarshxs

@adarshxs @zhaochenyang20 I’ve made a PR, please take a look. It fixes a large number of vulnerabilities in the image, but I don’t have permissions to run CI. Here’s the link: https://github.com/sgl-project/sglang/pull/5744

Swipe4057 avatar Apr 25 '25 10:04 Swipe4057

I am runing it. thanks! @Swipe4057

zhaochenyang20 avatar Apr 25 '25 16:04 zhaochenyang20

Several critical vulnerabilities and around a dozen high-severity vulnerabilities were detected by the Aqua scanner in our production environment. At the request of our security team, we continuously remediate vulnerabilities on our own. However, the delay in updating images in production can be up to 2 months.

Some of the vulnerabilities are related to the ssh server included in the image and the version of Pillow. Here’s how to fix them:

FROM lmsysorg/sglang:v0.4.5.post3-cu125

RUN apt-get update && apt-get install -y libjpeg-dev zlib1g-dev && rm -rf /var/lib/apt/lists/*

RUN pip install sglang-router

RUN sed -i -e 's/Pillow==8.3.1/Pillow==11.2.1/g' /opt/hpcx/clusterkit/bin/output/requirements.txt

RUN rm /etc/ssh/ssh_host_ecdsa_key /etc/ssh/ssh_host_ed25519_key /etc/ssh/ssh_host_rsa_key /etc/ssh/ssh_host_ecdsa_key.pub /etc/ssh/ssh_host_ed25519_key.pub /etc/ssh/ssh_host_rsa_key.pub

Image Image Image Image

I'm not sure if the issue you raised is the same as the one mentioned in the original post. If you're uncertain whether you're referring to the same matter, I don't think it should be discussed under this topic.

junliu-mde avatar Apr 25 '25 20:04 junliu-mde

Hey all, To avoid public disclosure of the vuln I have opened a GHSA ticket: https://github.com/sgl-project/sglang/security/advisories/GHSA-w9wq-8grq-mp55

Please move the discussion there. A fix started at https://github.com/sgl-project/sglang/pull/5752 but has not addressed my report yet.

avioligo avatar Apr 27 '25 12:04 avioligo

Following the merge of PR #5752 I opened another github security advisory:

  1. Unanswered - https://github.com/sgl-project/sglang/security/advisories/GHSA-w9wq-8grq-mp55 (Pickle issue)
  2. Unanswered - https://github.com/sgl-project/sglang/security/advisories/GHSA-hprm-w67m-xh4w (ZMQ bind address was too permissive)

We should issue a CVE for each code change through these advisories. No one have commented on the advisories yet and I await your response. Reminding you that responsible disclosure ends tomorrow (May 7th) as I reported it 90 days ago, and tried to emphasize the risks to the maintainers and x.ai by all means possible.

@adarshxs @zhaochenyang20 @zhyncs

avilum avatar May 06 '25 08:05 avilum

@adarshxs @junliu-mde @zhaochenyang20 Hey, is there an official response or update on this vulnerability?

You already started fixing some of the issues and merged fixes following my disclosure in https://github.com/sgl-project/sglang/pull/5752

vLLM, TensorRT-LLM, and Meta-Llama had fixed them promptly. When should we expect the pickle vulnerability to be fixed in SGLang?

avioligo avatar Jun 19 '25 08:06 avioligo

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Aug 19 '25 00:08 github-actions[bot]

https://www.oligo.security/blog/shadowmq-how-code-reuse-spread-critical-vulnerabilities-across-the-ai-ecosystem @avilum Thanks for the report.

@sundar24295s @adarshxs Any ideas about how to enhance it?

hnyls2002 avatar Nov 21 '25 17:11 hnyls2002

Does not impact isolated clusters yet. But we will have a fix soon using HMAC authentication or a safer deserialization method.

adarshxs avatar Nov 22 '25 09:11 adarshxs

@adarshxs Great, looking forward to the fix

hnyls2002 avatar Nov 22 '25 17:11 hnyls2002