website icon indicating copy to clipboard operation
website copied to clipboard

trainer: User guide for PyTorch Training

Open izuku-sds opened this issue 9 months ago • 3 comments

Checklist:

  • [x] You have signed off your commits
  • [x] Ensure you follow best practices from our guide. Contributing.
  • [ ] You have included screenshots when changing the website style or adding a new page.

Description of your changes: User guide for PyTorch Training

Issue

Closes: kubeflow/trainer#2543

Labels

/area trainer


izuku-sds avatar Mar 22 '25 02:03 izuku-sds

Hi @izuku-sds. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Mar 22 '25 02:03 google-oss-prow[bot]

PTAL @andreyvelich Thankyou.

izuku-sds avatar Mar 22 '25 03:03 izuku-sds

@andreyvelich is this good to merge ?

juliusvonkohout avatar Jun 17 '25 14:06 juliusvonkohout

@andreyvelich is this good to merge ?

Not yet, @izuku-sds maybe in PyTorch guide we can reduce content for the training function to add high-level items (e.g. define dataset, model, and training loop). We should also explain that Kubeflow Trainer automatically sets the correct environment for the PyTorch Distributed: https://docs.pytorch.org/tutorials/beginner/dist_overview.html E.g. WORLD_SIZE, RANK, and LOCAL_RANK is available inside the training function.

andreyvelich avatar Jun 26 '25 11:06 andreyvelich

cc @kramaranya @szaher @eoinfennessy

andreyvelich avatar Jul 22 '25 01:07 andreyvelich

@kramaranya @astefanutti Thanks for the review! Please let me know if we should update any other docs before we can merge this initial changes to the PyTorch guide.

andreyvelich avatar Jul 22 '25 11:07 andreyvelich

@Electronic-Waste @eoinfennessy @kramaranya @astefanutti Any other suggestions to the PR before we move forward?

andreyvelich avatar Jul 22 '25 16:07 andreyvelich

@andreyvelich Just one concern: #4053 (comment). But it should not block this PR:)

/lgtm

Sure, I will create the tracking issue in Kubeflow SDK

andreyvelich avatar Jul 22 '25 16:07 andreyvelich

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Jul 22 '25 16:07 google-oss-prow[bot]