Victor Zhu issues

Results 7 issues of


                                            Victor Zhu

Add GPT2 model train script with HuggingFace Trainer and SageMaker Model Parallel

*Description of changes:* Add GPT2 model train script with HuggingFace Trainer and SageMaker Model Parallel ## Merge Checklist - [x] I have read the [CONTRIBUTING](https://github.com/aws/amazon-sagemaker-examples/blob/master/CONTRIBUTING.md) doc and adhered to the...

SageMaker Sharded Data Parallel Support for Trainer

# What does this PR do? This PR adds support for SageMaker Sharded Data Parallel with SMP version >= 1.15. We mainly follow Deepspeed's checkpointing logic in our integration. When...

Add Bloom 560m parameter model example using SageMaker Model Parallel…

… with Sharded Data Parallelism through a custom SMP Trainer. This example shows you how to use SMP Trainer as a drop-in replacement for HuggingFace Trainer to enable Sharded Data...

Replacing nn.Linear w/ te.Linear FP8 convergence issue

Hi, I'm seeing higher losses using `te.Linear` over `nn.Linear` directly in transformer models such as Llama which I assume is expected due to the nature of FP8. However, I don't...

Update SMPv2 conda setup script with latest PT2.3.1 TSM2.4.0

*Issue #, if available:* *Description of changes:* Update conda environment setup to install latest PT2.3.1 TSM2.4.0 conda package and relevant dependencies. By submitting this pull request, I confirm that you...

Update model parallel v2 example notebooks for latest PT-2.3-TSM-2.5 release

*Issue #, if available:* *Description of changes:* *Testing done:* ## Merge Checklist _Put an `x` in the boxes that apply. You can also fill these out after creating the PR....

[BUG] Loss difference when training with FP8 vs. BF16 MoE

**Describe the bug** When enabling FP8 mixed precision during training of a Mixtral model (`SequentialMLP` expert layer), we are observing that training and validation loss differs more than expected. **To...