MS-AMP issues

Questions: Clarifying the use of FP8 for Training

1

@tocean @wkcn In line with the investigation in https://github.com/NVIDIA/TransformerEngine/issues/424, it would be great to get the insights from the team at microsoft for using FP8 in aspects of training besides...

jon-chuang

Auto scaling factor tuning for FP8 collective communication

**What would you like to be added**: Tune scaling factor automatically for fp8 collective communication. **Why is this needed**: Reduce the scaling factor to min value across all GPUs may...

tocean

Moving extension installation from post install to setup.py under project root folder

**What would you like to be added**: Moving extension installation from post install to setup.py under project root folder. **Why is this needed**: Extensions are part of MS-AMP and should...

tocean

AttributeError: 'ScalingTensor' object has no attribute 'view'

3

**What's the issue, what's expected?**: Error when using ms-amp to do llm sft. ms-amp deepspeed config: "msamp": { "enabled": true, "opt_level": "O1|O2|O3", # all tried "use_te": false } **How to...

LSC527

Integration with PyTorch Lightning

**What would you like to be added**: Integrate MS-AMP with PyTorch Lightning **Why is this needed**: MS-AMP shows huge gains in throughput when training in FP8. That's very exciting. Adoption...

schopra8

Request for Update to Support Latest Megatron-LM Version

Hello , I hope this message finds you well. I am a user of your msamp and have found it to be incredibly useful in my work with large-scale language...

nogizakar

[Security Fix] Avoid running workflow on self-hosted node.

**Description** Avoid running workflow on self-hosted node, including 1. switch image build to github runners. 2. remove the UT workflow.

guoshzhao

Stuck at Compilation of msccl_kernel.o

**What's the issue, what's expected?**: The compilation stuck at `./MS-AMP/third_party/msccl/build/obj/collectives/device/msccl_kernel.o` **How to reproduce it?**: `do the steps from the doc` **Log message or shapshot?**: ``` Compiling msccl_kernel.cu > .../MS-AMP/third_party/msccl/build/obj/collectives/device/msccl_kernel.o ```...

DefinitlyEvil

Make new optimizer more extensible, easier to integrate downstream for FSDP

6

**Description** This PR makes it easier for users to use FSDP with MS-AMP from their existing optimizers. This is especially beneficial for library authors, as currently we need to go...

muellerzr

DeepSpeed integration breaks existing DeepSpeed logic

2

**What's the issue, what's expected?**: There are attributes inside of regular `deepspeed.runtime` that are missing in this repo, and the monkey-patch doesn't cover, such as: ```python from deepspeed.runtime.lr_schedules import VALID_LR_SCHEDULES...

muellerzr

MS-AMP
MS-AMP copied to clipboard

Metadata

Questions: Clarifying the use of FP8 for Training

Auto scaling factor tuning for FP8 collective communication

Moving extension installation from post install to setup.py under project root folder

AttributeError: 'ScalingTensor' object has no attribute 'view'

Integration with PyTorch Lightning

Request for Update to Support Latest Megatron-LM Version

[Security Fix] Avoid running workflow on self-hosted node.

Stuck at Compilation of msccl_kernel.o

Make new optimizer more extensible, easier to integrate downstream for FSDP

DeepSpeed integration breaks existing DeepSpeed logic

← Metadata

Owner

Metadata

MS-AMP MS-AMP copied to clipboard

Metadata

← Metadata

Owner

Metadata

MS-AMP
MS-AMP copied to clipboard