Stas Bekman
Stas Bekman
Related: https://github.com/microsoft/DeepSpeed/issues/2811 https://github.com/microsoft/DeepSpeed/issues/2812
Looking at the rendering - the source formatting appears to be borked. It has :param: and the last section doesn't show up. And the doc is hard to read as...
Looking good now!
@tjruwase, thinking aloud here - also what's the point of saving tensor placeholders, if they are stripped of their ds metadata - this is again a source of confusion and...
You can see the original model weights here: https://huggingface.co/decapoda-research/llama-7b-hf/tree/main There are a total of about 13GB in float16 (in chunks of 400MB in 33 files) - but the code above...
We discussed that and so the tentative solution is to provide a helper util that will clone the z1 and z2 models to cpu which will undo the bloating. The...
Now to explain what happened. The problem happens because torch saves the full datastorage and if multiple tensors share it via `view()` it'll have to save the full storage. So...
I tried it out - and when the checkpoint is saved, I get almost all frozen weights saved with `size[0]` ``` python tools/convert_checkpoint/inspect_checkpoint.py /hf/m4-master-3/save_dir/opt_step-10/accelerator_state/pytorch_model/zero_pp_rank_0_mp_rank_00_model_states.pt loading checkpoint file: /hf/m4-master-3/save_dir/opt_step-10/accelerator_state/pytorch_model/zero_pp_rank_0_mp_rank_00_model_states.pt [tensor] module.lm_head.weight...
I'm also thinking would this even work if there is a huge model with a lot of frozen params? There might not be enough memory to gather them all. Perhaps...
Thank you for the quick solving and merge, Tunji and the team!