Deploy ToolHive Operator into OpenShift
We want to be able to deploy toolhive operator and associated MCP servers into OpenShift.
cc @jhrozek
@ChrisJBurns @jhrozek I recently started looking into this myself via my ootb-on-okd branch. In particular I generated a Dockerfile for thv-operator based on the Dockerfile for the Apache ActiveMQ Artemis broker operator we had been using. Also it follows the same pattern of using an entrypoint script. There's also an additional build task in the TaskFile.yml for the moment to build from the Dockerfile rather than modify the original.
Had you had ideas about how you thought it best to proceed in the context of the existing project setup? Not being that familiar with google ko myself this seemed like a quick and natural next step - i.e. to separate out the Dockerfile so that it can be customized directly depending on the need. Does that make sense or is there something simpler that can be done? This allows the operator to run in the test target env which is an OKD 4.19.z single node okd cluster.
@RoddieKieley are you struggling with the ko based container? Last I checked it ran fine on OpenShift. The main thing to tweak is the containers it spawns, for which we need to adjust the setup.
We had this thread before https://github.com/stacklok/toolhive/issues/342
I confirm that I could not deploy on Openshift with the documented Helm install. And, once I workarounded it with a local controller (go run main.go), the MCPServer deployment failed with some file permission issues (I tried to solve by injecting the XDG_*** env vars with no success).
Ah, at some point we fixed it ao it wouldnt fail if it couldn't write to the filesystem. Seems like it regressed...
For me on my 4.19.z okd sno in an out of the box following of the deployment instructions that work no problem for a minikube deployment I saw the following in the event log:
and as well
That being said I haven't tried it this week directly from the existing configuration as those screenshots were from the 11th.
Will try and update further.
I got the latest source and deployed using the latest tags yesterday, 0.0.11 for the CRDs and 0.1.8 for the operator, and still encountered the operator failing to start on OKD out of the box. However in looking more closely I saw I could simply add the required values to the helm chart as per this commit. Which adds the required seccompProfile.type: RuntimeDefault to the podSecurityContext and removes the runAsUser: 1000 setting.
At this point the operator itself starts up successfully as per:
Shortly thereafter there is an issue related to the operand and the operator pod is restarted, however haven't looked into the details there yet.
Given that some kubernetes environments will have specific requirements such as the customized values here for OpenShift, is this something that should be mentioned in documentation? I'd imagine that for larger changes maybe separate reference values.yaml might be needed but this doesn't seem to be the case here.
cc: @danbarr on the doc-related changes
@RoddieKieley
If I remember correctly, OCP can't run with an arbitrary user ID unless the SA is allowed to use the anyuid or privileged SCCs. And actually I don't see a reason for the operator and operand to run with an arbitrary ID, the randomly assigned IDs should work on OCP.
I think the approach about uncommenting the runAsUser in the operator deployment seems correct to me.
About the operand failing I suspect it's related to this code. What we could do is to check if the operator is running on OCP by checking for the presence of one of the proprietary OCP APIs, e.g. like this:
// Check for OpenShift-specific APIs
_, err := clientset.Discovery().ServerResourcesForGroupVersion("config.openshift.io/v1")
if err == nil {
// Running on OpenShift
}
and then modify the code I linked above to not set UIDs.
@RoddieKieley btw I tried the OpenShift developer access last week but that didn't have enough privileges to install CRDs. If we could get access to a cluster, we'll be happy to help ourselves.
Actually one more side-node - there are MCP servers that require a particular UID. They are typically badly written (e.g. the original fetch MCP server requires to be running as root (!!!!!)). I'm not sure if "badly written MCPs" is something worth supporting but if yes, then the operator would have to be allowed to use the anyuid SCC. Maybe that could be a deployment option though.
That's why we ditched the fetch MCP in favor of gofetch which we wrote in such a way it doesn't require root.
I think the order of the day at the moment when it comes to MCP servers is flexibility. Ideally it would be great to enable everyone to run whatever MCP server they wanted, but in reality as you indicate tool use safety is a thing too given the seriously negative consequences that are possible; e.g. Replit AI went rogue, deleted a company's entire database
For the OCP detection, I'd be pretty sure that OKD should behave the same as OCP in this case. For the ActiveMQ Artemis Operator we detect OCP by looking for the availability of Routes. And OKD does indeed have Routes which are an OpenShift addition.
To be fair for a while installing OKD as per the single node instructions it wasn't completely smooth sailing, but I don't recall the specifics as I last clean installed some time ago.
Will have to take a look at the other options while also checking out the code pointed to above.
Ah, it seems like OKD came a long way in terms of usability of installation since the last time I looked. Let me check if we could deploy a small OKD cluster locally to team up on the work.
I started looking at the ensurePodTemplateConfig code but then decided I should run it up and and take a look as there was a 1 line message that flashed in the log as the operator pod crash looped. Looks like the problem for the operator to stay running was straightforward to resolve:
It was being OOMKilled due to the default spec.containers.resources which were too small for my environment. So I removed the limits and gave the requests.cpu a value of '1' and memory a value of 512Mi for the moment and the operator stablized.
Will see about deploying some MCP Servers now and see what happens. Thanks for the engagement and pointers thus far!
Initially had the same experience as @dmartinol as per:
10:47PM ERR error loading configuration: unable to fetch config path: could not create any of the following paths: [/.config/toolhive /etc/xdg/toolhive]
Tried setting the env vars via the required env var array in the MCPServer CR instance and which didn't immediately resolve the problem. The fetch MCP Server is spun up in the Deployment but without the env vars set. Manually setting them in the Deployment results in a running fetch MCP Server instance that never becomes ready, hence the light blue rather than dark blue circle on the left in this shot:
This works however the log indicates it's waiting for the statefulset fetch to be ready:
However the fetch statefulset has zero pods associated with it, but it does appear to have the env vars set:
Likely an issue around how the mcp server instance is deploying the pod as the ownership, from what I think I recall, should rest with the statefulset so that it's associated with the headless service.
That being said the operator does correctly deploy into OpenShift at this point with the small updates above around needing more resources for my okd 4.19.z sno environment as well as the original seccompProfile: RuntimeDefault / runAsUser related updates.
Some slight updates around the screenshot for the fetch-69fc5f4b84-bjvnb pod above.
- It was complaining about similar issues regarding security context settings as per the operator. I created a new example mcpserver_fetch_with_pod_template.yaml on my branch to resolve that issue.
- The problem mentioned earlier around XDG env vars, specifically the XDG_CONFIG_HOME and HOME env vars, also existed for the MCP server when launched so I updated the operator source to temporarily hard code the env var values for the fetch MCP server in the Deployment so that they would be set for any pod in that Deployment.
Using minikube, main branch build, and the published toolhive runner 0.2.0 container image I see that the statefulset does indeed become ready, which is interesting and a remaining difference in behaviour.
10:10AM INF Using target port: 8080 10:10AM INF OIDC validation disabled, using local user authentication 10:10AM INF Using local user authentication for user: nonroot 10:10AM INF MCP parsing middleware enabled for transport 10:10AM INF Saved run configuration for fetch 10:10AM INF Setting up streamable-http transport... 10:10AM INF Deploying workload fetch from image ghcr.io/stackloklabs/gofetch/server... 10:10AM INF Applied statefulset fetch 10:10AM INF Created headless service fetch for SSE transport 10:10AM INF Waiting for statefulset fetch to be ready (0/1 replicas ready)... 10:10AM INF Container created with ID: fetch 10:10AM INF Starting streamable-http transport for fetch... 10:10AM INF Setting up transparent proxy to forward from host port 8080 to http://mcp-fetch-headless:8080 10:10AM INF Applied middleware 2 10:10AM INF Applied middleware 1 10:10AM INF HTTP transport started for container fetch on port 8080 10:10AM INF Transparent proxy started for container fetch on 0.0.0.0:8080 -> http://mcp-fetch-headless:8080 10:10AM INF MCP server fetch started successfully 10:10AM INF No client configuration files found 10:10AM INF Press Ctrl+C to stop or wait for container to exit 10:10AM INF MCP server not initialized yet, skipping health check for fetch
There were a couple of more places for the same type of work related to the user, group, fsGroup and seccompProfile settings that the okd environment required to be set differently than the default and didn't appear to be passed through to all the required places from the initial helm chart update for the operator deployment.
Temporarily hard coded updates to them just for okd here just to have things working for the moment. Will need to look at the data flow of those parameters to see where they should come from and where they should be set, or not set respectively, and also actually detect the okd environment correctly instead of just assuming it.
But at least the MCPServer instance for fetch is starting up correctly now.
For clarity I was adjusting the commit message in the 4 listed commits with the ed017de being the most recent and the one that PR#1253 actually references.
On a new clean install of the just updated helm charts for the operator, 0.2.6 with operator 0.2.8, upon creating the included fetch MCP server now seeing:
Which appears to be due to hard set 1000 for user and group in the MCP server deployment as per:
Will take a look and see what is needed to resolve the problem.
@RoddieKieley can we close this one?