DeepSpeed
DeepSpeed copied to clipboard
DeepSpeedZeroOptimizer: refactor bit16 flattening to support more accelerators
The approach till today use the practice where the torch.nn.parameter data is being replaced with a new cpu data storage, to offload device memory. All params are being flatenned on the host and moved to the device. in some accelerators torch.nn.parameter which is a device parameter cannot be assigned with a cpu storage. This PR copy the param data into a new cpu tensor, and shrinks the device storage. Later when the flat buffer is moved to the device param.data will be a view to the flat buffer.