accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

[Docs] Update low-precision training docs for MS-AMP

Open shimizust opened this issue 5 months ago • 3 comments

Hi team, I wanted to suggest updating the low-precision training docs, esp related to MS-AMP, which seems to no longer be maintained. When I was trying to get this working I ran into a few issues:

  • MS-AMP requires a container with it pre-installed. While they provide base images, they are built on much older CUDA versions (11.8/12.1), so using newer CUDA versions requires building from MS-AMP from source
  • MS-AMP requires MSCCL for multi-gpu communication with FP8, which is a fork of NCCL, but it uses a much older version of NCCL and hasn't been updated in 2-3 years. This leads to symbol conflicts when using more recent versions of pytorch built against newer NCCL

Are these valid? If so, I would recommend flagging MS-AMP as deprecated and directing users to TE or other approaches.

shimizust avatar Jun 19 '25 16:06 shimizust

cc @muellerzr if you have any insights as you added this. But probably a good step is to advise users to try TE or torchao to perform FP8. Indeed, it looks like ms-amp is not maintained anymore which can make it hard to work with as you experienced.

SunMarc avatar Jun 20 '25 12:06 SunMarc

Yeah last I checked I had wanted to deprecate it/remove MS-AMP since it's no longer maintained. if you want to get on that before I'm back @SunMarc feel free :D (~6 versions/mo out I think is fine since it breaks for users anyways)

muellerzr avatar Jun 22 '25 20:06 muellerzr

Perfect, thanks for the confirmation ! I'll do that soon

SunMarc avatar Jun 23 '25 09:06 SunMarc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 20 '25 15:07 github-actions[bot]

Hi @SunMarc I would like to update the doc for this. I will leave the details on MS-AMP for historical context but highlight the challenges, recommend TE and remove the language on using both in combo. Please assign the ticket to me and let me know if there are any code changes needed related to this. Thanks!

subhash686 avatar Aug 15 '25 06:08 subhash686