accelerate
accelerate copied to clipboard
Setting .train()/.eval() on unwrapped DDP model after prepare() causes increased memory usage during training.
System Info
- `Accelerate` version: 0.28.0
- Platform: Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.10.13
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.0+cu118 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.62 GB
- `Accelerate` default config:
Not found
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
The following causes this issue:
model_ = extract_model_from_parallel(self.model)
model_.first_child_model.eval()
model_.second_child_model.train()
Expected behavior
I am training some submodules and freezing others so before prepare(), I set eval()/train() on the proper submodules. Then, during eval, I am setting model.eval() and after eval, I am setting the submodules as they were at the start of training.
For an unknown reason, this uses more memory on the following training step. I am only running eval after a complete forward/backward.
If I simply set model.train() after my evaluation, this issue does not occur. I also made sure that there is not a direct memory leak from eval (both by using the pytorch memory profiler, and by checking the CUDA max/cur reserved memory).