deepanker13 comments

Results 16 comments of


                                            deepanker13

DataLoader time is always 0

Issue → PyTorch profiler not capturing Dataloader time and runtime. Always shows 0. Code used → I have used the code given in official PyTorch profiler documentation ( [PyTorch documentation](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html))...

Training: Add Fine-Tune API Docs

@andreyvelich the links in fine-tuning.md are giving 404 page not found. Am I missing something?

Training: Add Fine-Tune API Docs

> > @andreyvelich the links in fine-tuning.md are giving 404 page not found. Am I missing something? > > @deepanker13 Did you check these links via Website preview: https://deploy-preview-3718--competent-brattain-de2d6d.netlify.app/ ?...

[Release] Training Operator 1.8 Roadmap

> @johnugeorge @deepanker13 Do we need to create tracking issue with remaining items for Train/Fine-tune API for LLMs ? Okay I will create one

PVC creation as part of PyTorch job spec

/reopen

Fine-Tune APIs for LLM Documentation

@StefanoFioravanzo I can help with the tutorial. Also do you have any reference for api documentation?

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob

@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/pkg/controller.v1/pytorch/envvar.go#L109). And the above mentioned environment variables are...

deepanker13

DataLoader time is always 0

Training: Add Fine-Tune API Docs

Training: Add Fine-Tune API Docs

[Release] Training Operator 1.8 Roadmap

PVC creation as part of PyTorch job spec

Fine-Tune APIs for LLM Documentation

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob

Remaining items for Train/Fine-tune sdk

Consider container image rename of `kubeflow/storage-initializer`

KEP-2170: Create PyTorch multi-node distributed training runtime