lm-evaluation-harness
lm-evaluation-harness copied to clipboard
[big-refactor] Accelerate launch FSDP Runtime Error
Hi when running accelerate launch with FSDP I run into the following error:
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
I am running eval on 2 GPUs, the error message is replicate on both GPUs. Typically one batch is completed on one of the GPUs before erroring out.
What is the exact command you are running?
Same issue on Nvidia L4 x 2
Command:
accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks arc_challenge --batch_size 1 --num_fewshot=25
Accelerate conf:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Error:
......
File "/opt/conda/envs/eval/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
When i change num_processes to 1 it works.
Thanks
Same issue on Nvidia L4 x 2
Command:
accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks arc_challenge --batch_size 1 --num_fewshot=25
Accelerate conf:
compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: true fsdp_offload_params: false fsdp_sharding_strategy: 1 fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
Error:
...... File "/opt/conda/envs/eval/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: 'weight' must be 2-D
When i change num_processes to 1 it works.
Thanks
Using SIZE_BASED_WRAP it works (the memory allocated for each gpu is higher), is it normal?
I thought it was possible to use LLama2 with TRANSFORMER_BASED_WRAP.
facing the same issue with 8xA6000
I have this issue on main branch (release 4.0) on 8xa100s 40gb when trying to eval 70B models.
Trying SIZED_BASED_WRAP gets me another issue:
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/lib/python3/dist-packages/transformers/models/llama/modeling_llama.py", line 107, in forward
return self.weight * hidden_states.to(input_dtype)
RuntimeError: The size of tensor a (0) must match the size of tensor b (8192) at non-singleton dimension 2
No wrap would just OOM.
My config is almost identical to above, just num gpu difference.
Same issue here as well with TRANSFORMER_BASED_WRAP
: RuntimeError: 'weight' must be 2-D
SIZED_BASED_WRAP
seems to work but then NCCL timeouts (30minutes) on the last request batch. It is hanging on some processing.
We now recommend using vLLM instead of FSDP for fast / big model generation where possible.
#1520 may also fix the NCCL timeouts due to a padding bug?