Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

[BUG] "with deepspeed.zero.Init()" is not idempotent

Related: https://github.com/microsoft/DeepSpeed/issues/2811 https://github.com/microsoft/DeepSpeed/issues/2812

Clone tensors to avoid torch.save bloat

Looking at the rendering - the source formatting appears to be borked. It has :param: and the last section doesn't show up. And the doc is hard to read as...

Clone tensors to avoid torch.save bloat

Looking good now!

[zero_to_fp32] fix shared param recovery

@tjruwase, thinking aloud here - also what's the point of saving tensor placeholders, if they are stripped of their ds metadata - this is again a source of confusion and...

[BUG] Size of saved model checkpoint becomes much larger after deepspeed.initialize when using ZeRO-2

You can see the original model weights here: https://huggingface.co/decapoda-research/llama-7b-hf/tree/main There are a total of about 13GB in float16 (in chunks of 400MB in 33 files) - but the code above...

[BUG] Size of saved model checkpoint becomes much larger after deepspeed.initialize when using ZeRO-2

We discussed that and so the tentative solution is to provide a helper util that will clone the z1 and z2 models to cpu which will undo the bloating. The...

[BUG] Size of saved model checkpoint becomes much larger after deepspeed.initialize when using ZeRO-2

Now to explain what happened. The problem happens because torch saves the full datastorage and if multiple tensors share it via `view()` it'll have to save the full storage. So...

zero3 checkpoint frozen params

I tried it out - and when the checkpoint is saved, I get almost all frozen weights saved with `size[0]` ``` python tools/convert_checkpoint/inspect_checkpoint.py /hf/m4-master-3/save_dir/opt_step-10/accelerator_state/pytorch_model/zero_pp_rank_0_mp_rank_00_model_states.pt loading checkpoint file: /hf/m4-master-3/save_dir/opt_step-10/accelerator_state/pytorch_model/zero_pp_rank_0_mp_rank_00_model_states.pt [tensor] module.lm_head.weight...

zero3 checkpoint frozen params

I'm also thinking would this even work if there is a huge model with a lot of frozen params? There might not be enough memory to gather them all. Perhaps...

zero3 checkpoint frozen params

Thank you for the quick solving and merge, Tunji and the team!