training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Not getting Kubeflow Training SDK v1.7 when installing `kubeflow-training`

Open JamesKunstle opened this issue 10 months ago • 14 comments

In a new virtual environment, I'm installing kubeflow-training only.

This is the freeze I get:

cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
google-auth==2.29.0
idna==3.7
kubeflow-training==1.7.0
kubernetes==29.0.0
oauthlib==3.2.2
pyasn1==0.6.0
pyasn1_modules==0.4.0
python-dateutil==2.9.0.post0
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==2.0.0
retrying==1.3.4
rsa==4.9
setuptools==69.5.1
six==1.16.0
urllib3==2.2.1
websocket-client==1.8.0

However, when I inspect the code that's been installed at new_venv/lib/python3.12/site-packages/kubeflow/training/api/training_client.py the code isn't up to date with the 1.7 SDK release that I can see here on GitHub.

Specifically, I see that the function get_job_logs is different. I need to most updated one.

JamesKunstle avatar Apr 24 '24 18:04 JamesKunstle

Thank you for creating this @JamesKunstle. We publish SDK on each Training Operator release: https://pypi.org/project/kubeflow-training/. E.g. the latest version is 1.7, so to see the changes for that SDK, you need to check the release-1.7 branch: https://github.com/kubeflow/training-operator/blob/v1.7-branch/sdk/python/kubeflow/training/api/training_client.py

andreyvelich avatar Apr 24 '24 19:04 andreyvelich

What would be the supported path to get the most up-to-date SDK code? The main-branch code does what I want, but not the code that gets pulled when I install the kubeflow-training library

JamesKunstle avatar Apr 24 '24 19:04 JamesKunstle

@andreyvelich how do you publish release to PyPi? I took a look at the code and I didn't see any actions doing a release automatically. I reached out to @tenzen-y on this as well.

franciscojavierarceo avatar Apr 24 '24 20:04 franciscojavierarceo

FWIW @andreyvelich for Feast we have the release process fully automated and deployed to PyPi with this action: https://github.com/feast-dev/feast/actions/workflows/release.yml

franciscojavierarceo avatar Apr 24 '24 20:04 franciscojavierarceo

Happy to help out and replicate the same here if that would be desirable.

franciscojavierarceo avatar Apr 24 '24 20:04 franciscojavierarceo

Could you try something like this?

pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk/python"

I've never installed from a subdirectory before but I think this should work

anishasthana avatar Apr 24 '24 20:04 anishasthana

@JamesKunstle If you want to get the latest changes for SDK, I added the scripts in this PR: https://github.com/kubeflow/website/pull/3719. Similar to @anishasthana's comment, you can do this:

pip install git+https://github.com/kubeflow/training-operator.git@7345e33b333ba5084127efe027774dd7bed8f6e6#subdirectory=sdk/python

andreyvelich avatar Apr 24 '24 21:04 andreyvelich

@andreyvelich how do you publish release to PyPi? I took a look at the code and I didn't see any actions doing a release automatically. I reached out to @tenzen-y on this as well.

Currently, for Training Operator we don't have script to automate release process. So, @johnugeorge is publishing SDK manually after we cut the release. However, for Katib SDK we have this script that we run to publish Images + SDK after the release: https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/release.sh#L85-L97.

Happy to help out and replicate the same here if that would be desirable.

That would be awesome if you could help us to automate releases for Training Operator/Katib. We have this issue that we created a while ago: https://github.com/kubeflow/katib/issues/2049.

andreyvelich avatar Apr 24 '24 21:04 andreyvelich

On a similar note: we have a ton of github actions we built to automate releases for codeflare. Some links...

  1. https://github.com/project-codeflare/codeflare-sdk/blob/main/.github/workflows/release.yaml
  2. https://github.com/project-codeflare/codeflare-operator/blob/main/.github/workflows/project-codeflare-release.yml

anishasthana avatar Apr 24 '24 22:04 anishasthana

However, for Katib SDK we have this script that we run to publish Images + SDK after the release: https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/release.sh#L85-L97.

So is publishing the image also manual?

franciscojavierarceo avatar Apr 25 '24 03:04 franciscojavierarceo

However, for Katib SDK we have this script that we run to publish Images + SDK after the release: https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/release.sh#L85-L97.

So is publishing the image also manual?

We usually publish the operator image by https://github.com/kubeflow/training-operator/blob/86e0df17db715543b366e885c9ae659aa1342c8e/.github/workflows/publish-core-images.yaml#L24-L26.

tenzen-y avatar Apr 25 '24 04:04 tenzen-y

@andreyvelich @anishasthana Okay yeah that works now, I can see the most recent changes. Would really appreciate a more "pypi"-y way of installing the latest release, I think I was getting a fairly old package when I was installing by name from pypi.

JamesKunstle avatar Apr 25 '24 14:04 JamesKunstle

@andreyvelich @anishasthana Okay yeah that works now, I can see the most recent changes. Would really appreciate a more "pypi"-y way of installing the latest release, I think I was getting a fairly old package when I was installing by name from pypi.

Basically, we release SDK when we make another release of Training Operator to keep all component versions consistent: Controller + SDK. That helps us to keep versions stable. Any thoughts @JamesKunstle ?

andreyvelich avatar Apr 26 '24 18:04 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 25 '24 20:07 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Aug 15 '24 00:08 github-actions[bot]