training-operator
training-operator copied to clipboard
Add Env Var in InitContainer
https://github.com/kubeflow/training-operator/blob/05badc6ee8a071400efe9019d8d60fc242818589/pkg/controller.v1/xgboost/xgboost.go#L74
Can we add Env var in initContainer? for example, if we use data parallel, we may expect every pod can use it's rank to get their own training dataset and we can download dataset when pod init, app container has no need to care where dataset comes from.
Let me redescribe my situation. I want to pass some environment variables like RANK to init container to do something like dataset download. then It can decouple business code and other code. Do you have any suggestions?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/lifecycle frozen /good-first-issue
/good-first-issue
@johnugeorge: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
In response to this:
/good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I am interested to work on this ticket.
/assign
I'm glad to know someone is working on this. So you want to move these env variables into initContainer?
Yes I could do that.
I presume the requirement here is to move environment variables such as RANK from the application container to an init container for the XGBoost training operator so that the code for dataset download is decoupled from the actual business logic in the application container.
I was wondering if you have any specific use case in the context of which you were thinking about the problem.
This is case i met long time ago. I used xgboostjob to do distributed learning in LightGBM. the data was stored on s3, I had to download different part of train data to different worker. My solution is to exec shell to download s3 data into pod before start the training python progress. The data preparation should be decoupled otherwise i must copy the same download code in dockerfile in other job like xgboostjob. I think the init container may help this. use a common image to download data by using some env variables like
image: download-img:1.1
env:
s3_input_path: s3://bucket/prefix
partition_strategy: even/all
local_output_path: /data
It's just my simple proposal. And please tell me if you have any other solutions. BTW, Even there is no such case, It may be wired that the init container doesn't know these env variables from the design perspective.
I didn't have much time before. If you're busy now. I could help this.
@namespace-io Thank you for the explanation. It would be good to collaborate to get this done.
A follow-up question that I have and one that @johnugeorge and @tenzen-y may also have more intuition to qualify is whether we should also implement the same approach across other training operators - PyTorch, TensorFlow, etc.
@namespace-io Also wondering if you have any thoughts on the specifics of how to implement your proposal.
To paint the picture I worked through the code for the PyTorchJob controller and I noticed that the SetClusterSpec function sets up the pod environment variables and the init container separately. This begged the question for @johnugeorge and @tenzen-y on whether we implement @namespace-io's proposal to other controllers as well.
The setPodEnv function in the PyTorchJob controller sets the rank and world size while setInitContainer simply sets up the expected init container.
Please let me know your thoughts on the expected design for the code and whether all controllers are expected to adhere to a common set of design patterns.