mpi-operator
mpi-operator copied to clipboard
[Question] Best practice for distributing training code to workers
From the examples provided, it seems that the training script that each MPI worker will execute are bundled in the container images. Is there a better recommended approach for distributing the training code? Ideally looking for something that will allow images to be more static and only contain the required dependencies, while training code can be distributed as part of setting up the job.
Feel free to point me to docs or more relevant material that covers this topic. I just could not find anything myself.
I believe the training scripts should be separated from the training image, ideally. As code, training scripts are better organized and maintained on github/gitlab/etc. But either hosted on a server or just a cloud storage, you can inject an (initContainer)[https://kubernetes.io/docs/concepts/workloads/pods/init-containers/] which clones/copies code to an empty dir volume shared with the training container.