training-operator Add Env Var in InitContainer

https://github.com/kubeflow/training-operator/blob/05badc6ee8a071400efe9019d8d60fc242818589/pkg/controller.v1/xgboost/xgboost.go#L74

Can we add Env var in initContainer? for example, if we use data parallel, we may expect every pod can use it's rank to get their own training dataset and we can download dataset when pod init, app container has no need to care where dataset comes from.

Nov 11 '22 05:11 namespace-io

Let me redescribe my situation. I want to pass some environment variables like RANK to init container to do something like dataset download. then It can decouple business code and other code. Do you have any suggestions?

Feb 16 '23 09:02 namespace-io

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 24 '23 15:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sep 13 '23 20:09 github-actions[bot]

/lifecycle frozen /good-first-issue

Oct 08 '23 17:10 johnugeorge

/good-first-issue

Oct 08 '23 17:10 johnugeorge

@johnugeorge: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 08 '23 17:10 google-oss-prow[bot]

I am interested to work on this ticket.

/assign

Mar 10 '24 21:03 tariq-hasan

I'm glad to know someone is working on this. So you want to move these env variables into initContainer?

Mar 19 '24 03:03 namespace-io

Yes I could do that.

I presume the requirement here is to move environment variables such as RANK from the application container to an init container for the XGBoost training operator so that the code for dataset download is decoupled from the actual business logic in the application container.

I was wondering if you have any specific use case in the context of which you were thinking about the problem.

Apr 01 '24 05:04 tariq-hasan

This is case i met long time ago. I used xgboostjob to do distributed learning in LightGBM. the data was stored on s3, I had to download different part of train data to different worker. My solution is to exec shell to download s3 data into pod before start the training python progress. The data preparation should be decoupled otherwise i must copy the same download code in dockerfile in other job like xgboostjob. I think the init container may help this. use a common image to download data by using some env variables like

image: download-img:1.1
env:
   s3_input_path: s3://bucket/prefix
   partition_strategy: even/all
   local_output_path: /data

It's just my simple proposal. And please tell me if you have any other solutions. BTW, Even there is no such case, It may be wired that the init container doesn't know these env variables from the design perspective.

I didn't have much time before. If you're busy now. I could help this.

Apr 02 '24 08:04 namespace-io

@namespace-io Thank you for the explanation. It would be good to collaborate to get this done.

A follow-up question that I have and one that @johnugeorge and @tenzen-y may also have more intuition to qualify is whether we should also implement the same approach across other training operators - PyTorch, TensorFlow, etc.

Apr 03 '24 00:04 tariq-hasan

@namespace-io Also wondering if you have any thoughts on the specifics of how to implement your proposal.

To paint the picture I worked through the code for the PyTorchJob controller and I noticed that the SetClusterSpec function sets up the pod environment variables and the init container separately. This begged the question for @johnugeorge and @tenzen-y on whether we implement @namespace-io's proposal to other controllers as well.

The setPodEnv function in the PyTorchJob controller sets the rank and world size while setInitContainer simply sets up the expected init container.

Please let me know your thoughts on the expected design for the code and whether all controllers are expected to adhere to a common set of design patterns.

Apr 08 '24 22:04 tariq-hasan

training-operator training-operator copied to clipboard

Add Env Var in InitContainer

training-operator
training-operator copied to clipboard