vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Support SSL Key Rotation in HTTP Server

Open youngkent opened this issue 10 months ago • 1 comments

Some production setup requires TLS key/cert rotation in HTTP server. This change use watchfiles to async'ly monitor the updates of ssl key, cert, and CA files, and update the SSLContext when changes are detected.

Test cmd vllm serve /tmp/model -tp 1 --max_num_seqs 32 --ssl-keyfile ~/test_certs/server.key --ssl-certfile ~/test_certs/server.crt --ssl-ca-certs ~/test_certs/rootCA.crt

touch ~/test_certs/server.key

Server output: INFO 02-18 11:40:01 launcher.py:24] Watching files: ['/home/ktong/test_certs/server.key', '/home/ktong/test_certs/server.crt'] INFO 02-18 11:40:01 launcher.py:24] Watching files: ['/home/ktong/test_certs/rootCA.crt'] INFO: Application startup complete. INFO 02-18 11:42:31 launcher.py:28] File change detected: modified - /home/ktong/test_certs/server.key INFO 02-18 11:42:31 launcher.py:57] Reloading SSL certificate chain

youngkent avatar Feb 18 '25 19:02 youngkent

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Feb 18 '25 19:02 github-actions[bot]

@russellb @robertgshaw2-redhat could you please take a look? Unfortunately I have little background on this.

WoosukKwon avatar Feb 19 '25 02:02 WoosukKwon

@russellb @robertgshaw2-redhat could you please take a look? Unfortunately I have little background on this.

Sure.

What are some other examples of services that do dynamic reloading of configuration like this? I would normally expect a configuration rollout to include restart services. Running in an environment like Kubernetes makes this fairly straightforward.

I think it's going to be more complicated to account for all possible cases. What if the files are just deleted? Should the service keep running with what it had previously loaded? Should it exit with an error?

My preference would be to leave it as-is, and leave it to the administrator (or service automation) to decide when the service should restart with new configuration.

russellb avatar Feb 19 '25 16:02 russellb

@russellb For example, a thrift server also supports SSL key/cert rotation: https://github.com/facebook/fbthrift/blob/a3b88c21b4bf382d506922c2d874b21a7c06b821/thrift/lib/cpp2/server/ThriftServer.cpp#L1876-L1881 Restarting a server should work, but it's intrusive. For infrastructure with decoupled key rotation and server rollout, supporting SSL key rotation without requiring restarting is actually needed. For infrastructure that does not do key rotation, the logic here basically have no side effect. Or would you prefer adding a gating flag for the feature?

youngkent avatar Feb 19 '25 17:02 youngkent

Thanks for the example. I've thought about this some more and I'm still not really comfortable with the feature.

Automatic reloading based on files changing strikes me as very surprising behavior. An alternative that some services use is allow you to send them a SIGHUP signal to reload their configuration. This would typically be hidden behind something like systemd, so to an administrator it's systemctl reload <service>.

Whether it was via SIGHUP or not, only dynamically reloading this but not other files seems like surprising behavior. For example, what should happen if a model is updated on disk?

Another reason I'm more on the side of keeping this simple is I don't expect using built-in SSL to be the production SSL endpoint in most cases. When running in Kubernetes, I'd expect a load balancer serving as ingress into the cluster would terminate SSL. In other words, I'd prefer to keep this simple and defer the more complex and dynamic configuration management to systems outside of vllm.

russellb avatar Feb 19 '25 22:02 russellb

I want to clarify one more thing. You should not interpret my comments as a rejection of the PR! I'm not a maintainer and don't have that authority. I'm just stating my gut reaction to the feature and certainly don't mind if the consensus after maintainer review goes toward accepting it!

russellb avatar Feb 19 '25 22:02 russellb

@russellb thanks for reviewing and sharing your opinion. The file monitoring is limited to SSL key rotation at the moment. Generally, I feel people shouldn't mix up the expectation of model file loading with SSL key rotation. (Do we need to make it more clearly stated through documentation?)

About relying on systemctl reload <service>, I feel it might be better to keep vLLM service more self contained without relying assumptions like using systemctl to manage it. vLLM service are used in various infra setup, it would be nice to keep it flexible to support different types of infra setup.

cc: @WoosukKwon @simon-mo to hear about your suggestions.

youngkent avatar Feb 20 '25 00:02 youngkent

About relying on systemctl reload , I feel it might be better to keep vLLM service more self contained without relying assumptions like using systemctl to manage it.

just to be clear, I don't suggest that systemd should be required for this. On the vLLM side, it's handling the SIGHUP signal. There's different ways to send the signal to trigger a reload and systemctl reload ... is just one example of where it can be wrapped up in a nice interface, consistent with triggering reloads across multiple system services.

russellb avatar Feb 20 '25 14:02 russellb

I think there's at least a small race condition here, where either the server cert OR the CA cert will be updated, but not both (but both need to be updated). One or more connections could get handled in between updating each file.

This is well scoped enough to handle a common deployment scenarios. I do agree SIGHUP might be a better option if designed from scratch but this PR offers isolated functionality for certain integrations.

It's a trivial change to use SIGHUP, FWIW.

russellb avatar Feb 23 '25 13:02 russellb

@russellb Once we have a deployment environment that supports SIGHUP signaling when certs are updated, I think we can definitely extend the functionality here to support SIGHUP mode.

youngkent avatar Feb 24 '25 02:02 youngkent