mpi-operator [Question] Best practice for distributing training code to workers

[Question] Best practice for distributing training code to workers

Open gmatev opened this issue 3 years ago • 1 comments

From the examples provided, it seems that the training script that each MPI worker will execute are bundled in the container images. Is there a better recommended approach for distributing the training code? Ideally looking for something that will allow images to be more static and only contain the required dependencies, while training code can be distributed as part of setting up the job.

Feel free to point me to docs or more relevant material that covers this topic. I just could not find anything myself.

Jun 28 '21 17:06 gmatev

I believe the training scripts should be separated from the training image, ideally. As code, training scripts are better organized and maintained on github/gitlab/etc. But either hosted on a server or just a cloud storage, you can inject an (initContainer)[https://kubernetes.io/docs/concepts/workloads/pods/init-containers/] which clones/copies code to an empty dir volume shared with the training container.

Jun 29 '21 01:06 zw0610

mpi-operator mpi-operator copied to clipboard

[Question] Best practice for distributing training code to workers

mpi-operator
mpi-operator copied to clipboard