Nadav Elyahu issues

Results 13 issues of


                                            Nadav Elyahu

TorchCheckpointEngine: torch.save using pickle protocol 4

to allow large tensor serialization > 4B. Can reproduce this by running the attached files: 1. put both files in same directory. 2. change the .txt to .py 3. run...

Pipeline: Add support to eval micro bs configuration

When running evaluation the general memory consumption is reduced. Mainly due to absence of gradients, and hanging FWD activations. It allows to increase the micro-bs and improve the evaluation performance....

zero_to_fp32.py: Handle a case where shape doesn't have numel attr

DeepSpeedZeroOptimizer: refactor bit16 flattening to support more accelerators

The approach till today use the practice where the torch.nn.parameter data is being replaced with a new cpu data storage, to offload device memory. All params are being flatenned on...

estimate_zero2_model_states_mem_needs: fixing memory estiamtion

was considering 4 bytes per model param, and 4 bytes per gradient. fixed it to 2 bytes - under the assumption of FP16/BF16

Allow accelerator to instantiate the device

when instantiating torch.device for HPU it cannot be fed with HPU:1 annotation, but only "HPU". moving the logic to accelerator will allow to solve this issue, with single line change.

re-introduce: stage3: efficient compute of scaled_global_grad_norm

reverting previous revert of this feature: https://github.com/nelyahu/DeepSpeed/commit/bc48371c5e1fb8fd70fc79285e66201dbb65679b in addition, bug fix for offload mode.

DeepSpeedCheckpoint: support custom final ln idx

till today only last layer (idx=-1) was considered using FINAL_LAYER_NORM_INDEX which is set to -1. this PR allows the user to pass custom value for model where this default value...

Nadav Elyahu