DeepSpeed
DeepSpeed copied to clipboard
[BUG] ValueError: max() arg is an empty sequence using bf16 zero stage3
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py:307 │
│ in <listcomp> │
│ │
│ 304 │ │ │ max([ │
│ 305 │ │ │ │ max(tensor.numel(), │
│ 306 │ │ │ │ │ tensor.ds_numel) for tensor in fp16_partitioned_g │
│ ❱ 307 │ │ │ ]) for fp16_partitioned_group in self.fp16_partitioned_gr │
│ 308 │ │ ]) │
│ 309 │ │ print_rank_0( │
│ 310 │ │ │ f'Largest partitioned param numel = {largest_partitioned_ │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: max() arg is an empty sequence
To Reproduce Steps to reproduce the behavior: Happened during finetuning on flan 11b model . Here is the entire error gist - https://gist.github.com/sujithjoseph/c410514acfccc76974a8130a8afd2169
Here is the deepspeed config https://gist.github.com/sujithjoseph/92bf27de6bba704b57c3b9eb7aa00365
ds_report output ds report - https://gist.github.com/sujithjoseph/c725de5fb38bb3c20e4fb6fd55f63848
System info (please complete the following information):
- OS: Debian GNU/Linux 10 (buster)
- GPU count and types [ 1 machine with 4 A100s - 40G*4]
- Python version 3.7
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else? Accelerate + PEFT
deepspeed_config:
deepspeed_config_file: zero_stage3_offload_config.json zero3_init_flag: true
Additional context
I assume that bf16 configs and fp16 configs are interchangeable
"bf16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
}
Error also appears with fp16 instead of bf16 in deepspeed config and with zero3_init_flag: false in accelerate config with deepspeed as well.
With Stage2 and no offsets , Get a different error
/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:323 in __init__ │
│ │
│ 320 │ │ │ │ self.flatten_dense_tensors_aligned( │
│ 321 │ │ │ │ │ self.round_robin_bit16_groups[i], │
│ 322 │ │ │ │ │ self.nccl_start_alignment_factor * │
│ ❱ 323 │ │ │ │ │ dist.get_world_size(group=self.real_dp_process_gr │
│ 324 │ │ │ │ │ │ torch.cuda.current_device())) │
│ 325 │ │ │ see_memory_usage(f"After flattening and moving param grou │
│ 326 │ │ │ │ │ │ │ force=False) │
│ │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:862 in flatten_dense_tensors_aligned │
│ │
│ 859 │ │
│ 860 │ # create a flat tensor aligned at the alignment boundary │
│ 861 │ def flatten_dense_tensors_aligned(self, tensor_list, alignment): │
│ ❱ 862 │ │ return self.flatten(align_dense_tensors(tensor_list, alignmen │
│ 863 │ │
│ 864 │ ############### Independent Partition Gradient ################## │
│ 865 │ def reduce_independent_p_g_buckets_and_remove_grads(self, param, │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: torch.cat(): expected a non-empty list of Tensors
Could this be an issue with dataset?
Was able to sort it out using the below accelerate + DS config. Now dealing with an OOM issue, but not sure why the previous DeepSpeed config didnt work
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: true
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false
How can we estimate the # of GPUs Needed (each with 40 GB) . for flan-t5-11b with cpu param / optimizer offloading , 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 is what the estimate is.
Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 4 GPUs per node. SW: Model with 11003M total params, 131M largest layer params. per CPU | per GPU | Options 276.70GB | 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 276.70GB | 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 245.95GB | 5.61GB | offload_param=none, offload_optimizer=cpu , zero_init=1 245.95GB | 5.61GB | offload_param=none, offload_optimizer=cpu , zero_init=0 2.94GB | 46.61GB | offload_param=none, offload_optimizer=none, zero_init=1 245.95GB | 46.61GB | offload_param=none, offload_optimizer=none, zero_init=0
With Batch size as 1, it works without OOM. How can we estimate the # of GPUs needed for batchsize of 4 or 8, without trial and error. With batch size as 1, I see only 26903MiB used per GPU max. For batch size as 2, It works for some time (3-4 hours) with 8 40 GB GPUs with almost all 40 GB utilized and then goes into OOM. How can I cap the GPU memory used?
With the following deepspeed config
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
bf16:enabled: true
and torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True set in the code, would deepspeed use tf32 or bf16 ?
@sujithjoseph, deepspeed should use bf16. Are you observing something different?
@tjruwase , it did work with bf16. The only Question I have is Can i use max memory to restrict the memory used by the model during fine-tuning like the below one used for inference etc.
max_memory={0: "25GIB", "cpu":"120GB"}
model = load_checkpoint_and_dispatch( model, model_id, device_map="auto", max_memory=max_memory, no_split_module_classes=["T5Block"] )
Got it. I don't have experience with those memory restriction flags, which seem to be Accelerate flags. I don't think those flags are hooked into deepspeed. Can you please pose this question on their forum? I think we can work with them to enable the desired feature.
@sujithjoseph I faced the same issues as your memtioned above, both issue in stage 2 and stage 3 were the same to you. Did you have any workaround for this?
Ran into exact same error when running DeepSpeed on Ray. Following this thread.
same error RuntimeError: torch.cat(): expected a non-empty list of Tensors when accelerate.prepare. So how to solve it?
@zhenlohuang, @shaowei-su, @SupetZYK, it seems that @sujithjoseph resolved the original issue with the following https://github.com/microsoft/DeepSpeed/issues/2820#issuecomment-1427255633
If the workaround does not work for you, please open a new issue and share details to help us repro. Thanks!
@tjruwase I was able to run DS + stage 3 + fp16 by disabling optimizer
section in the DS config, which I found negative impacts on the model quality.
If I switch to DS + stage 2, then it's the same runtime error @SupetZYK posted above.
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
self.flatten_dense_tensors_aligned(
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors
@shaowei-su and @SupetZYK, it seems you are both seeing a different error from the original posting. Can you please open a new issue and share details for repro? I will close this in the meantime. Thanks!
same error here,any update?
FWIW I got this error when I put my model accidentally on inference mode.
My peft config had inference_mode: True
.
Same error when using loralib with zero2 & 3
@bestpredicts and @Wesley-Jzy, are you able to provide repro steps?
Can people in this thread please downgrade to HF transformers 4.31.0 and try?