DeepSpeed [REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads

When https://github.com/microsoft/DeepSpeed/blob/c27483933d50a693fef9c48418d2664cf6a6a6f8/deepspeed/utils/zero_to_fp32.py was written 3 years ago models were small and converted fast. Now with 70B+ models the conversion can take hours.

The original script uses a single cpu core.

Here is a possible implementation algorithm:

The way I was thinking multiple cores could be utilized by loading all shards into the cpu memory and then firing off multiple threads, each re-composing a single layer - the user could specify how many cores to use or by default all cores will be used - so that n_threads == cores. I think the total memory usage here will still be 2x model size * dtype just like in the original script.

Possible additional changes:

Using safetensors would be a bonus because then each tensor could be written separately and there is no need to wait for the whole model to be unsharded to write a single torch tensor. This could also become an option for low RAM nodes, where each layer is unsharded sequentially and total memory usage will be 1x model size * dtype + max layer size * dtype, which for a large model be a huge memory saving, at the cost of not parallelizing - or perhaps using just 1-2 threads, which would already speed things up.
Switching to universal checkpoint API would be another bonus because the original is very clunky and very difficult to understand/maintain.

cc: @tjruwase

Sep 11 '24 22:09 stas00

@tjruwase, has the work started on this? Thank you!

Sep 30 '24 02:09 stas00

@stas00, yes work has started thanks to @xylian86 and @minjiazhang.

Oct 04 '24 13:10 tjruwase

@stas00 Hi Stats, I am working on it. My idea is switching to universal checkpoint API so that the new version script can support broader range of the parallelism strategies, including PP, TP, ZeRO-DP (The current version only support ZeRO-DP)

Here's an overview of the planned improvements. Please let me know if you have additional questions regarding these updates.

[ ] Switch to universal checkpoint API
[ ] Support Frozen Parameters
[ ] Support Shared Parameters
[ ] Add the support for output to SateTensor (refer PR#6579)
[ ] Add the support for output to FP16/BF16

Oct 14 '24 03:10 xylian86

That's great news, @xylian86 - there are quite a few folks hoping to speed up their large checkpoint conversion. So thank you for working on that!

Your plan looks great to me!

Oct 14 '24 03:10 stas00