Nadav Elyahu
Nadav Elyahu
to allow large tensor serialization > 4B. Can reproduce this by running the attached files: 1. put both files in same directory. 2. change the .txt to .py 3. run...
When running evaluation the general memory consumption is reduced. Mainly due to absence of gradients, and hanging FWD activations. It allows to increase the micro-bs and improve the evaluation performance....
The approach till today use the practice where the torch.nn.parameter data is being replaced with a new cpu data storage, to offload device memory. All params are being flatenned on...
was considering 4 bytes per model param, and 4 bytes per gradient. fixed it to 2 bytes - under the assumption of FP16/BF16
when instantiating torch.device for HPU it cannot be fed with HPU:1 annotation, but only "HPU". moving the logic to accelerator will allow to solve this issue, with single line change.
reverting previous revert of this feature: https://github.com/nelyahu/DeepSpeed/commit/bc48371c5e1fb8fd70fc79285e66201dbb65679b in addition, bug fix for offload mode.
till today only last layer (idx=-1) was considered using FINAL_LAYER_NORM_INDEX which is set to -1. this PR allows the user to pass custom value for model where this default value...