DeepSpeed [REQUEST] DeepSpeed Zero3 swap off gradients unnecessarily when swap

trafficstars

** DeepSpeed Zero3 swap off gradients unnecessarily when swap_optimizer is True

According to the function DeepSpeedZeroOptimizer_Stage3.partition_grads in https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1194, gradients are accumulated into tensors holding by self.__param_id_to_grad_partition, so why bother swap out gradients in gradient accumulation boundary? https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1258

        if self.offload_optimizer and self.swap_optimizer:
            for i in offload_fp32_gradients.keys():
                self.optimizer_swapper.swap_out_gradients(parameter=self.fp32_partitioned_groups_flat[i],
                                                          gradient_offsets=offload_fp32_offsets[i],
                                                          gradient_tensors=offload_fp32_gradients[i])

I'm quite confused by this, please help me out. thanks!

@tjruwase

Jun 04 '23 07:06 platoonpluto

The offload_fp32_gradients that are swapped out are the fp32 versions of the gradients needed for optimizer.step() computation. On the other hand, self.__param_id_to_grad_partition holds 16-bit gradients.

Jun 05 '23 01:06 tjruwase

The offload_fp32_gradients that are swapped out are the fp32 versions of the gradients needed for optimizer.step() computation. On the other hand, self.__param_id_to_grad_partition holds 16-bit gradients.

I mean, get fp16/bf16 versions of gradients from self.__param_id_to_grad_partition and copy it into self.fp32_partitioned_groups_flat[i].grad should works well. no need to swap off gradients

self._prepare_sub_group(sub_group_id, timer_names)
...
# prepare self.fp32_partitioned_groups_flat[i].grad
src_grad = self.__param_id_to_grad_partition[param.ds_id].narrow(0, 0, param.partition_numel)
dest_offset = 
dest_grad = self.fp32_partitioned_groups_flat[i].grad.narrow(0, dest_offset, param.partition_numel)
dest_grad.copy_(src_grad)
...
self._optimizer_step(sub_group_id)

Jun 05 '23 02:06 platoonpluto

The swap mode is enabled when optimizer state is offloaded to nvme because both GPU and CPU memory are too small. In that case, there is little benefit to keeping fp32 gradients in GPU/CPU to avoid swap overhead since:

fp32 gradients account for only 1/4 of optimizer footprint and swap traffic.
It could result in OOM and restricting the supported model size.

Offloading to nvme is designed for extreme model scaling on limited hardware budget, and in such cases gradient swapping is not a significant perf bottleneck.

Jun 05 '23 15:06 tjruwase

The swap mode is enabled when optimizer state is offloaded to nvme because both GPU and CPU memory are too small. In that case, there is little benefit to keeping fp32 gradients in GPU/CPU to avoid swap overhead since:

fp32 gradients account for only 1/4 of optimizer footprint and swap traffic.

It could result in OOM and restricting the supported model size.

Offloading to nvme is designed for extreme model scaling on limited hardware budget, and in such cases gradient swapping is not a significant perf bottleneck.

I didn't make myself clear before. I'm not trying to keep fp32 gradients in GPU/CPU to avoid swap overhead. In fact, fp32 gradients is not necessary at all. because it's almost identical to 16-bits gradients, while 16-bits gradients is always kept in GPU/CPU. so, fp32 gradients acts as a temporary variable around optimizer.step(). It could be created by converting corresponding 16-bits gradients, and released immediately after calling optimizer.step()

In other words, swap off gradients and swap in optimizer states is almost identical to self.__param_id_to_grad_partition[param.ds_id].narrow(0, offset, numel).to(torch.float)

Jun 05 '23 15:06 platoonpluto

@platoonpluto, thanks for the clarification. Yes, your observation is correct, we could avoid since the swap overhead since the 16-bits gradients are always available in self.__param_id_to_grad_partition.

One reason for my initial confusion is that the design was not intended to persist 16-bit gradients in CPU memory when optimizer swapping is enabled, so this is a deviation of the implementation that I had overlooked/forgotten. What to do about this is a little tricky because we do want to have an offloading mode where the entire training state (parameters, gradients, and optimizer) is persisted on nvme, and swapped in on-demand. Your observation shows that this not true for 16-bit gradients. I will consult with the team and update this thread asap. Thanks for catching this.

Jun 05 '23 17:06 tjruwase

@tjruwase thanks for your explanation. By the way, Is there any detailed design documentation for DeepSpeed? it's quite difficult to follow all details. there is a gap between deepspeed implementation and this peaper "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning".

Jun 06 '23 02:06 platoonpluto

DeepSpeed
DeepSpeed copied to clipboard

[REQUEST] DeepSpeed Zero3 swap off gradients unnecessarily when swap_optimizer is True

DeepSpeed DeepSpeed copied to clipboard

[REQUEST] DeepSpeed Zero3 swap off gradients unnecessarily when swap_optimizer is True

DeepSpeed
DeepSpeed copied to clipboard