accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Setting .train()/.eval() on unwrapped DDP model after prepare() causes increased memory usage during training.

Open alexanderswerdlow opened this issue 3 months ago • 0 comments

System Info

- `Accelerate` version: 0.28.0
- Platform: Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.10.13
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.0+cu118 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.62 GB
- `Accelerate` default config:
	Not found

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

The following causes this issue:

model_ = extract_model_from_parallel(self.model)
model_.first_child_model.eval()
model_.second_child_model.train()

Expected behavior

I am training some submodules and freezing others so before prepare(), I set eval()/train() on the proper submodules. Then, during eval, I am setting model.eval() and after eval, I am setting the submodules as they were at the start of training.

For an unknown reason, this uses more memory on the following training step. I am only running eval after a complete forward/backward.

If I simply set model.train() after my evaluation, this issue does not occur. I also made sure that there is not a direct memory leak from eval (both by using the pytorch memory profiler, and by checking the CUDA max/cur reserved memory).

alexanderswerdlow avatar Mar 15 '24 18:03 alexanderswerdlow