feat: Add LoRA fine-tuning optimum-neuron example for slurm
Issue #, if available:
Description of changes:
This example uses the slurm as orchestrator for the Optimum-neuron LoRA fine-tuning example. This is also targeting to the workshop under https://catalog.workshops.aws/sagemaker-hyperpod/en-US.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Hi Captainia, thank you for the submission! Is this ready for review?
Please connect with me on slack if you need permissions to the workshop. my amazon alias is same as github alias
Let's use x.filename instead of x_filename to align with the naming convention of other test cases (ex. https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/cpu-ddp/slurm).
Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with https://github.com/aws-samples/awsome-distributed-training/pull/631/files
Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR.
Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with https://github.com/aws-samples/awsome-distributed-training/pull/631/files
Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR.
Sounds good! Thank you @Captainia for working on it.