DeepSpeed [BUG] ValueError: max() arg is an empty sequence using bf16 zero stage3

│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py:307  │
│ in <listcomp>                                                                │
│                                                                              │
│    304 │   │   │   max([                                                     │
│    305 │   │   │   │   max(tensor.numel(),                                   │
│    306 │   │   │   │   │   tensor.ds_numel) for tensor in fp16_partitioned_g │
│ ❱  307 │   │   │   ]) for fp16_partitioned_group in self.fp16_partitioned_gr │
│    308 │   │   ])                                                            │
│    309 │   │   print_rank_0(                                                 │
│    310 │   │   │   f'Largest partitioned param numel = {largest_partitioned_ │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: max() arg is an empty sequence

To Reproduce Steps to reproduce the behavior: Happened during finetuning on flan 11b model . Here is the entire error gist - https://gist.github.com/sujithjoseph/c410514acfccc76974a8130a8afd2169

Here is the deepspeed config https://gist.github.com/sujithjoseph/92bf27de6bba704b57c3b9eb7aa00365

ds_report output ds report - https://gist.github.com/sujithjoseph/c725de5fb38bb3c20e4fb6fd55f63848

System info (please complete the following information):

OS: Debian GNU/Linux 10 (buster)
GPU count and types [ 1 machine with 4 A100s - 40G*4]
Python version 3.7

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? Accelerate + PEFT

deepspeed_config:

deepspeed_config_file: zero_stage3_offload_config.json zero3_init_flag: true

Additional context

I assume that bf16 configs and fp16 configs are interchangeable

    "bf16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    }

Feb 12 '23 23:02 sujithjoseph

Error also appears with fp16 instead of bf16 in deepspeed config and with zero3_init_flag: false in accelerate config with deepspeed as well.

Feb 13 '23 00:02 sujithjoseph

With Stage2 and no offsets , Get a different error

 /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:323 in __init__                                                           │
│                                                                              │
│    320 │   │   │   │   self.flatten_dense_tensors_aligned(                   │
│    321 │   │   │   │   │   self.round_robin_bit16_groups[i],                 │
│    322 │   │   │   │   │   self.nccl_start_alignment_factor *                │
│ ❱  323 │   │   │   │   │   dist.get_world_size(group=self.real_dp_process_gr │
│    324 │   │   │   │   │   │   torch.cuda.current_device()))                 │
│    325 │   │   │   see_memory_usage(f"After flattening and moving param grou │
│    326 │   │   │   │   │   │   │    force=False)                             │
│                                                                              │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:862 in flatten_dense_tensors_aligned                                      │
│                                                                              │
│    859 │                                                                     │
│    860 │   # create a flat tensor aligned at the alignment boundary          │
│    861 │   def flatten_dense_tensors_aligned(self, tensor_list, alignment):  │
│ ❱  862 │   │   return self.flatten(align_dense_tensors(tensor_list, alignmen │
│    863 │                                                                     │
│    864 │   ############### Independent Partition Gradient ################## │
│    865 │   def reduce_independent_p_g_buckets_and_remove_grads(self, param,  │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Could this be an issue with dataset?

Feb 13 '23 01:02 sujithjoseph

Was able to sort it out using the below accelerate + DS config. Now dealing with an OOM issue, but not sure why the previous DeepSpeed config didnt work

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: true
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

Feb 13 '23 02:02 sujithjoseph

How can we estimate the # of GPUs Needed (each with 40 GB) . for flan-t5-11b with cpu param / optimizer offloading , 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 is what the estimate is.

Feb 13 '23 03:02 sujithjoseph

With Batch size as 1, it works without OOM. How can we estimate the # of GPUs needed for batchsize of 4 or 8, without trial and error. With batch size as 1, I see only 26903MiB used per GPU max. For batch size as 2, It works for some time (3-4 hours) with 8 40 GB GPUs with almost all 40 GB utilized and then goes into OOM. How can I cap the GPU memory used?

Feb 13 '23 03:02 sujithjoseph

With the following deepspeed config

deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  bf16:enabled: true

and torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True set in the code, would deepspeed use tf32 or bf16 ?

Feb 13 '23 18:02 sujithjoseph

@sujithjoseph, deepspeed should use bf16. Are you observing something different?

Feb 14 '23 22:02 tjruwase

@tjruwase , it did work with bf16. The only Question I have is Can i use max memory to restrict the memory used by the model during fine-tuning like the below one used for inference etc.

max_memory={0: "25GIB", "cpu":"120GB"}

model = load_checkpoint_and_dispatch( model, model_id, device_map="auto", max_memory=max_memory, no_split_module_classes=["T5Block"] )

Feb 16 '23 00:02 sujithjoseph

Got it. I don't have experience with those memory restriction flags, which seem to be Accelerate flags. I don't think those flags are hooked into deepspeed. Can you please pose this question on their forum? I think we can work with them to enable the desired feature.

Feb 16 '23 02:02 tjruwase

@sujithjoseph I faced the same issues as your memtioned above, both issue in stage 2 and stage 3 were the same to you. Did you have any workaround for this?

Mar 16 '23 07:03 zhenlohuang

Ran into exact same error when running DeepSpeed on Ray. Following this thread.

Apr 01 '23 20:04 shaowei-su

same error RuntimeError: torch.cat(): expected a non-empty list of Tensors when accelerate.prepare. So how to solve it?

Apr 13 '23 15:04 SupetZYK

@zhenlohuang, @shaowei-su, @SupetZYK, it seems that @sujithjoseph resolved the original issue with the following https://github.com/microsoft/DeepSpeed/issues/2820#issuecomment-1427255633

If the workaround does not work for you, please open a new issue and share details to help us repro. Thanks!

Apr 13 '23 15:04 tjruwase

@tjruwase I was able to run DS + stage 3 + fp16 by disabling optimizer section in the DS config, which I found negative impacts on the model quality.

If I switch to DS + stage 2, then it's the same runtime error @SupetZYK posted above.

  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
    self.flatten_dense_tensors_aligned(
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
    return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Apr 15 '23 21:04 shaowei-su

@shaowei-su and @SupetZYK, it seems you are both seeing a different error from the original posting. Can you please open a new issue and share details for repro? I will close this in the meantime. Thanks!

May 15 '23 14:05 tjruwase

same error here,any update?

May 23 '23 06:05 bestpredicts

FWIW I got this error when I put my model accidentally on inference mode. My peft config had inference_mode: True.

Jun 08 '23 06:06 seongminp

Same error when using loralib with zero2 & 3

Jun 09 '23 08:06 Wesley-Jzy

@bestpredicts and @Wesley-Jzy, are you able to provide repro steps?

Sep 06 '23 13:09 tjruwase

Can people in this thread please downgrade to HF transformers 4.31.0 and try?

Sep 08 '23 21:09 awan-10