[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads
When https://github.com/microsoft/DeepSpeed/blob/c27483933d50a693fef9c48418d2664cf6a6a6f8/deepspeed/utils/zero_to_fp32.py was written 3 years ago models were small and converted fast. Now with 70B+ models the conversion can take hours.
The original script uses a single cpu core.
Here is a possible implementation algorithm:
The way I was thinking multiple cores could be utilized by loading all shards into the cpu memory and then firing off multiple threads, each re-composing a single layer - the user could specify how many cores to use or by default all cores will be used - so that n_threads == cores. I think the total memory usage here will still be 2x model size * dtype just like in the original script.
Possible additional changes:
- Using
safetensorswould be a bonus because then each tensor could be written separately and there is no need to wait for the whole model to be unsharded to write a single torch tensor. This could also become an option for low RAM nodes, where each layer is unsharded sequentially and total memory usage will be1x model size * dtype+max layer size * dtype, which for a large model be a huge memory saving, at the cost of not parallelizing - or perhaps using just 1-2 threads, which would already speed things up. - Switching to universal checkpoint API would be another bonus because the original is very clunky and very difficult to understand/maintain.
cc: @tjruwase
@tjruwase, has the work started on this? Thank you!
@stas00, yes work has started thanks to @xylian86 and @minjiazhang.
@stas00 Hi Stats, I am working on it. My idea is switching to universal checkpoint API so that the new version script can support broader range of the parallelism strategies, including PP, TP, ZeRO-DP (The current version only support ZeRO-DP)
Here's an overview of the planned improvements. Please let me know if you have additional questions regarding these updates.
- [ ] Switch to universal checkpoint API
- [ ] Support Frozen Parameters
- [ ] Support Shared Parameters
- [ ] Add the support for output to SateTensor (refer PR#6579)
- [ ] Add the support for output to FP16/BF16
That's great news, @xylian86 - there are quite a few folks hoping to speed up their large checkpoint conversion. So thank you for working on that!
Your plan looks great to me!