mpi-operator
mpi-operator copied to clipboard
Run as non-root user
Are there any plans for an implementation of mpi-operator that can function as user other than root?
/cc @terrytangyuan @carmark
No plan on this yet. If you have any suggestions on how to approach this and would like to contribute, please let us know here!
From what I can tell, it seems the only part that requires root is for updating the /etc/hosts file. Please correct me if I'm wrong.
I've been thinking on a few ideas of how to approach it. If this is the only reason, the solution would be finding an alternate method for making the job members resolvable.
What I've attempted is to create a headless service for the stateful set as described here. I could imagine using this technique could potentially cause a race condition in some clusters where DNS is not yet resolvable when mpirun execution is attempted, but this could be scripted around. Something that checks name resolution prior to issuing the exec and launching mpirun. I haven't personally seen any issue with not having all members being immediately resolvable.
I've been hacking around with the code and been able to add the creation of a headless service to the stateful set creation, it seems to work, but MPI fails to execute because the launcher, in it's current state, is a job member and also needs to be resolvable.
I can think of 2 different solutions for making the launcher resolvable.
-
Use kubernetes built in HostAlias I've attempted to add this to the stateful set, but it appears the statefulset is defined by the controller at submission, before the launcher job actually begins running, so the API query to return the launcher's hostname and IP to add to the stateful set are empty.
-
Don't include the launcher as part of the MPI job. Exec into the rank0 worker and execute mpirun from there. I haven't fully fleshed out this idea and having an inter-pod exec controlling the job (that could last hundreds of hours) might be doomed.
Potentially an alternate approach might be an init container that creates a world writable hosts file in an empty dir volume that overwrites the main container /etc/hosts
I'm sure there are others that may have much more elegant solutions for how to approach this. I appreciate you taking the time to entertain my ideas and request.
Hi, I have stupid question - why does this operator need to change the /etc/hosts in first place? This file is managed by k8s... Thanks
Having all MPI worker names resolvable is an MPI requirement.
@jcatana This is clear, but what the information that service name doesn't contain?
That is sort of the issue. Currently no service is created for the stateful set. To further complicate the issue, kubernetes services cannot address kubernetes Jobs. This is why I talk about utilizing the hostAlias parameter in the statefulSet definition.
A kubernetes MPIJob is comprised of two objects: Launcher -> Kind: Job -> this executes the mpirun command. Workers -> Kind: StatefulSet -> these are the targets of mpirun command.
Neither of these Kind create integrated kubernetes DNS entries, which is why, I'm guessing, MPI operator modifies the local /etc/hosts files in the pods. Modification of this file requires root (usually...). All participating entities in the MPI job must be resolvable, workers and launcher.
My proposal is to use built in kubernetes methods of making the objects resolvable so modification of /etc/hosts is not necessary and therefore root would not be required.
@jcatana Kubernetes DNS entries are created by services, which the most common why to define them is by selector of PODs. Actually, it doesn't matter if the POD was created by deployment, stateful set, job or directly. The only possible problem that I see with this response time until the PODs of job will be added to endpoint (or endpoint slice) object that service relies on.
@jcatana I looked at code and there are two references to /etc/hosts:
- generateHosts that add to it, this code can be replaced by generate service, either by set selector or more directly by creating endpoint object and then service on top of it
- newConfigMap the code copy the information from file by kubectl cp, bad idea anyway (this command rely on a lot of other stuff - existence of tar, writable root, etc.) and can be replaced by putting the services' names.
Note that newConfigMap is for v1alpha2 so I am not sure that it is really need to be fixed it, and therefore generateHosts isn't needed (or it will be possible to turn it off by some flag)
Yes, this solution would also work, but it has the same issue as hostAlias. You do not know the podSelector label until after the Job's pod is created much like for hostAlias you do not know the pod's IP until after it is created.
Here is my code that will create the headless service for stateful set https://github.com/jcatana/mpi-operator/ IIt also modifies the sh script that is run not to attempt to modify the hosts file. 've only been playing with V1alpha2. I'd be happy to collaborate on the last piece of the puzzle.
Just to note: MpiJobs created from my code fail because the main launcher is not resolveable.