awsome-distributed-training feat: Add LoRA fine-tuning optimum-neuron example for slurm

Issue #, if available:

Description of changes:

This example uses the slurm as orchestrator for the Optimum-neuron LoRA fine-tuning example. This is also targeting to the workshop under https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Apr 15 '25 17:04 Captainia

Hi Captainia, thank you for the submission! Is this ready for review?

Please connect with me on slack if you need permissions to the workshop. my amazon alias is same as github alias

Apr 16 '25 16:04 nghtm

Let's use x.filename instead of x_filename to align with the naming convention of other test cases (ex. https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/cpu-ddp/slurm).

Apr 17 '25 12:04 KeitaW

Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with https://github.com/aws-samples/awsome-distributed-training/pull/631/files

Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR.

Apr 17 '25 13:04 Captainia

Could you please consolidate all the python files for k8s/slurm optimum test case into single directory? I see some duplicates with https://github.com/aws-samples/awsome-distributed-training/pull/631/files

Thank you, that sounds good. Will rebase and refactor for the python dependencies once the EKS example is merged. Currently it is hard to reuse them across PRs nor I think we should merge the two into a single giant PR.

Sounds good! Thank you @Captainia for working on it.

Apr 18 '25 06:04 KeitaW