accelerate Running accelerate test after setting up FSDP returns an error

System Info

- `Accelerate` version: 0.19.0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.35
- Python version: 3.10.11
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1 (True)
- System RAM: 448.57 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_min_num_params': 100000000, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT'}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

pip install accelerate
accelerate config -> choose FSDP, fully shared [1], SIZE_BASED_WRAP, BACKWARD_PRE
accelerate test

Output:

RuntimeError: 'accelerate-launch /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1      
                                                                                                                                                               
The combined stderr from workers follows:                                                                                                                      
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer                                     
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                           
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:4 │
│ 61 in <module>                                                                                   │
│                                                                                                  │
│   458                                                                                            │
│   459                                                                                            │
│   460 if __name__ == "__main__":                                                                 │
│ ❱ 461 │   main()                                                                                 │
│   462                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:4 │
│ 52 in main                                                                                       │
│                                                                                                  │
│   449 │                                                                                          │
│   450 │   if state.local_process_index == 0:                                                     │
│   451 │   │   print("\n**Training integration test**")                                           │
│ ❱ 452 │   training_check()                                                                       │
│   453                                                                                            │
│   454                                                                                            │
│   455 def _mp_fn(index):                                                                         │
│                                                                                                  │
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:3 │
│ 22 in training_check                                                                             │
│                                                                                                  │
│   319 │   │   │   optimizer.step()                                                               │
│   320 │                                                                                          │
│   321 │   model = accelerator.unwrap_model(model).cpu()                                          │
│ ❱ 322 │   assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU o   │
│   323 │   assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU o   │
│   324 │                                                                                          │
│   325 │   accelerator.print("Training yielded the same results on one CPU or distributed setup   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Float did not match BFloat16

Expected behavior

The test completes succesfully.

May 26 '23 08:05 AntreasAntoniou

cc @pacman100

May 26 '23 12:05 sgugger

any updates?

Jun 06 '23 22:06 AntreasAntoniou

Looking into it, give me a couple days

Jun 07 '23 10:06 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 01 '23 15:07 github-actions[bot]

Any updates? @pacman100

Jul 03 '23 14:07 AntreasAntoniou

Hello, you are running the tests incorrectly.

First, you set mixed precision="bf16". The training_check function is comparing mocked training of a single GPU (without accelerator.prepare) with FP32 dtype to the model trained post accelerator.prepare. Hence, Float did not match BFloat16 error you are observing. Even after you change the mixed precision to "no", you still can't compare because FSDP flattens params which are not accessible by model.a

Jul 04 '23 11:07 pacman100

So how do I go about getting the test to pass?

Jul 04 '23 16:07 AntreasAntoniou

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 29 '23 15:07 github-actions[bot]

accelerate accelerate copied to clipboard

Running accelerate test after setting up FSDP returns an error

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard