[BUG] ZeRO++ sharding small parameter raise IndexError
Describe the bug
Our model has a small parameter with shape torch.Size([32]), when enable ZeRO++, it raise following error:
- world_size: 2048
- zero_hpz_partition_size: 16
File "/usr/local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344, in prepare
result = self._prepare_deepspeed(*args)
File "/usr/local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1851, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
File "/usr/local/lib/python3.8/site-packages/deepspeed/__init__.py", line 181, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 313, in __init__
self.optimizer = self._configure_zero_optimizer(optimizer=None)
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1590, in _configure_zero_optimizer
optimizer = DeepSpeedZeRoOffload(
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 119, in __init__
self._convert_to_zero_parameters(ds_config, module, mpu)
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 194, in _convert_to_zero_parameters
Init(module=module,
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1017, in __init__
self._convert_to_zero_parameters(module.parameters(recurse=True))
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1048, in _convert_to_zero_parameters
self._zero_init_param(param)
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1040, in _zero_init_param
param.partition()
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1375, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1523, in _partition
self._partition_param_sec(param)
File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1696, in _partition_param_sec
sec_numel).copy_(one_dim_param.narrow(0, secondary_start, sec_numel))
IndexError: start out of range (expected to be in range of [-32, 32], but got 1920)
...
IndexError: start out of range (expected to be in range of [-32, 32], but got 1152)
IndexError: start out of range (expected to be in range of [-32, 32], but got 640)
IndexError: start out of range (expected to be in range of [-32, 32], but got 128)
IndexError: start out of range (expected to be in range of [-32, 32], but got 384)
IndexError: start out of range (expected to be in range of [-32, 32], but got 896)
IndexError: start out of range (expected to be in range of [-32, 32], but got 1792)
IndexError: start out of range (expected to be in range of [-32, 32], but got 1280)
IndexError: start out of range (expected to be in range of [-32, 32], but got 1536)
IndexError: start out of range (expected to be in range of [-32, 32], but got 1408)
IndexError: start out of range (expected to be in range of [-32, 32], but got 256)
IndexError: start out of range (expected to be in range of [-32, 32], but got 1024)
IndexError: start out of range (expected to be in range of [-32, 32], but got 768)
Here is our deepspeed config
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"communication_data_type": "fp32",
"gradient_clipping": 1.0,
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"zero_hpz_partition_size": 16,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": 5e8,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
I think the problem is that when partitioning second param copy, the param is aligned to dp_world_size instead zero_hpz_partition_size, and then cause torch.narrow raising IndexError.
https://github.com/microsoft/DeepSpeed/blob/v0.14.5/deepspeed/runtime/zero/partition_parameters.py#L1662
In my case, torch.Size([32]) is aligned to 2048, and secondary_partition_size is 2048/16=128, so one_dim_param.narrow(0, 128, 128) is out of range.
@HeyangQin
hi, have you solved this problem?
Hi did you find a solution to this? @GuanhuaWang does the zero++ team have any advice for this issue?