accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Running accelerate test after setting up FSDP returns an error

Open AntreasAntoniou opened this issue 2 years ago • 3 comments

System Info

- `Accelerate` version: 0.19.0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.35
- Python version: 3.10.11
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1 (True)
- System RAM: 448.57 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_min_num_params': 100000000, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT'}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. pip install accelerate
  2. accelerate config -> choose FSDP, fully shared [1], SIZE_BASED_WRAP, BACKWARD_PRE
  3. accelerate test

Output:

RuntimeError: 'accelerate-launch /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1      
                                                                                                                                                               
The combined stderr from workers follows:                                                                                                                      
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer                                     
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                           
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:4 │
│ 61 in <module>                                                                                   │
│                                                                                                  │
│   458                                                                                            │
│   459                                                                                            │
│   460 if __name__ == "__main__":                                                                 │
│ ❱ 461 │   main()                                                                                 │
│   462                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:4 │
│ 52 in main                                                                                       │
│                                                                                                  │
│   449 │                                                                                          │
│   450 │   if state.local_process_index == 0:                                                     │
│   451 │   │   print("\n**Training integration test**")                                           │
│ ❱ 452 │   training_check()                                                                       │
│   453                                                                                            │
│   454                                                                                            │
│   455 def _mp_fn(index):                                                                         │
│                                                                                                  │
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:3 │
│ 22 in training_check                                                                             │
│                                                                                                  │
│   319 │   │   │   optimizer.step()                                                               │
│   320 │                                                                                          │
│   321 │   model = accelerator.unwrap_model(model).cpu()                                          │
│ ❱ 322 │   assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU o   │
│   323 │   assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU o   │
│   324 │                                                                                          │
│   325 │   accelerator.print("Training yielded the same results on one CPU or distributed setup   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Float did not match BFloat16

Expected behavior

The test completes succesfully.

AntreasAntoniou avatar May 26 '23 08:05 AntreasAntoniou

cc @pacman100

sgugger avatar May 26 '23 12:05 sgugger

any updates?

AntreasAntoniou avatar Jun 06 '23 22:06 AntreasAntoniou

Looking into it, give me a couple days

pacman100 avatar Jun 07 '23 10:06 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 01 '23 15:07 github-actions[bot]

Any updates? @pacman100

AntreasAntoniou avatar Jul 03 '23 14:07 AntreasAntoniou

Hello, you are running the tests incorrectly.

First, you set mixed precision="bf16". The training_check function is comparing mocked training of a single GPU (without accelerator.prepare) with FP32 dtype to the model trained post accelerator.prepare. Hence, Float did not match BFloat16 error you are observing. Even after you change the mixed precision to "no", you still can't compare because FSDP flattens params which are not accessible by model.a

pacman100 avatar Jul 04 '23 11:07 pacman100

So how do I go about getting the test to pass?

AntreasAntoniou avatar Jul 04 '23 16:07 AntreasAntoniou

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 29 '23 15:07 github-actions[bot]