DeepSpeed
DeepSpeed copied to clipboard
[BUG] (NVMe Offload with Zero3) Not enough buffers 0 for swapping 1
Hi, I am currently trying off-the-shelf tranformer example with deepspeed:
BS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-11b --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \
--max_eval_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \
--do_train --num_train_epochs 8 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \
--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 10 --save_steps 0 \
--eval_steps 5 --group_by_length --dataset_name wmt16 --dataset_config ro-en --source_lang en \
--target_lang ro --source_prefix "translate English to Romanian: " \
--deepspeed tests/deepspeed/ds_config_zero3_nvme_offload.json
The config file ds_config_zero3_nvme_offload.json has zero3 params from the main documentation (https://huggingface.co/docs/transformers/main_classes/deepspeed#zero3-example) website like this:
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"aio": {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
},
}
I get the following error:
Not enough swap in buffers 0 for 1 params, ids = [258]
Num inflight: params 0, buffers 0, numel = 0
Num available params: count = 5, ids = {259, 233, 207, 246, 220}, numel = 167772160
.
.
.
File "/home/xxx/anaconda3/envs/profiler/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 813, in all_gather_coalesced
AssertionError: Not enough buffers 0 for swapping 1
self._ensure_availability_of_partitioned_params(params)
File "/home/xxx/anaconda3/envs/profiler/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 999, in _ensure_availability_of_partitioned_params
swap_in_list[0].nvme_swapper.swap_in(swap_in_list, async_op=False)
File "/home/xxx/anaconda3/envs/profiler/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 308, in swap_in
assert len(swap_in_paths) <= len(self.available_buffer_ids), f"Not enough buffers {len(self.available_buffer_ids)} for swapping {len(swap_in_paths)}"
AssertionError: Not enough buffers 0 for swapping 1
I don't get this error if the offload_param device is set to cpu instead of nvme. I am curious why this is happening and how to fix this. Also, this happens regardless I add aio params or remove all of them. Please let me know.
Thank you!
Hi @HeyangQin and @tjruwase , Could you help me resolve this issue? This is happening irrespective of training/inference, happens only during parameter offloading to NVMe.
@srikanthmalla, how many GPUs are you running? Also, please share the impact of the following ds_config adjustments
- Set
stage3_max_reuse_distance
to 0 - Increase
buffer_count
inoffload_param
to 10, 15, and 20.
@srikanthmalla, did either of these changes help?
Hi @tjruwase, both didn't help.
Thanks for the update. I was able to reproduce the problem. My initial look suggests that this is due to the optimization of prefetching and caching layer parameters to reduce offload overheads. The following error message shows that the buffer_count:5
setting of offload_param
is eventually exceeded:

The meaning of the above message is that we are unable to add param 258
to the offload cache because it is full, containing params 259, 233, 207, 246, 220
I was able to work around this issue by disabling caching (i.e., "stage3_max_reuse_distance": 0
). But since that did not work for you, I am concerned perhaps I have reproduced a different problem. Can you please try the following to help this investigation?
- Share how many GPUs you are using? I reproduced on 4xV100-16GB.
- Use this branch: https://github.com/microsoft/DeepSpeed/tree/olruwase/issue_3062. Please note that this branch includes offload debug prints that will increase log size.
- Run your original failing configuration and share the log.
- Run with disabled caching (
"stage3_max_reuse_distance": 0
) and disabled prefetching ("stage3_prefetch_bucket_size": 0
) to see if that combination avoids the error.
@srikanthmalla, just to clarify, we recognize that there is an underlying issue that needs to be fixed. My requests above are just to help further understand this issue. Thanks!
Hi, I'm getting same issue when using deepspeed 0.10.0 with huggingface transformers.
727AssertionError: Not enough buffers 0 for swapping 1726 assert len(swap_in_paths) <= len(725 File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 297, in swap_in
Deepspeed config:
zero_optimization:
stage: 3
offload_optimizer:
device: nvme
nvme_path: /tmp/nvme_offoad
offload_param:
device: nvme
nvme_path: /tmp/nvme_offoad