Renate
Renate copied to clipboard
Make experiment evaluation a separate process
The current experimentation code in benchmarking runs evaluation in the same thread as the (subsequent) trainings. This is a problem when using DDP as the first evaluation (Line 295) creates several processes (as many GPUs) and each of them try to spawn training processes causing a problem with DDP ports clashing.
Describe the solution you'd like
The evaluation/testing should run in a separate process a la run_training_job
and this issue wouldn't occur.
**Additional info: See the discussion on Lightning forum: https://github.com/Lightning-AI/lightning/issues/2537