accelerate
accelerate copied to clipboard
Running accelerate test after setting up FSDP returns an error
System Info
- `Accelerate` version: 0.19.0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.35
- Python version: 3.10.11
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1 (True)
- System RAM: 448.57 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_min_num_params': 100000000, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT'}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [ ] My own task or dataset (give details below)
Reproduction
- pip install accelerate
- accelerate config -> choose FSDP, fully shared [1], SIZE_BASED_WRAP, BACKWARD_PRE
- accelerate test
Output:
RuntimeError: 'accelerate-launch /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1
The combined stderr from workers follows:
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:4 │
│ 61 in <module> │
│ │
│ 458 │
│ 459 │
│ 460 if __name__ == "__main__": │
│ ❱ 461 │ main() │
│ 462 │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:4 │
│ 52 in main │
│ │
│ 449 │ │
│ 450 │ if state.local_process_index == 0: │
│ 451 │ │ print("\n**Training integration test**") │
│ ❱ 452 │ training_check() │
│ 453 │
│ 454 │
│ 455 def _mp_fn(index): │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py:3 │
│ 22 in training_check │
│ │
│ 319 │ │ │ optimizer.step() │
│ 320 │ │
│ 321 │ model = accelerator.unwrap_model(model).cpu() │
│ ❱ 322 │ assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU o │
│ 323 │ assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU o │
│ 324 │ │
│ 325 │ accelerator.print("Training yielded the same results on one CPU or distributed setup │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Float did not match BFloat16
Expected behavior
The test completes succesfully.
cc @pacman100
any updates?
Looking into it, give me a couple days
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Any updates? @pacman100
Hello, you are running the tests incorrectly.
First, you set mixed precision="bf16". The training_check function is comparing mocked training of a single GPU (without accelerator.prepare) with FP32 dtype to the model trained post accelerator.prepare. Hence, Float did not match BFloat16 error you are observing. Even after you change the mixed precision to "no", you still can't compare because FSDP flattens params which are not accessible by model.a
So how do I go about getting the test to pass?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.