transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Support Asynchronous Evaluation on Separate GPU in `Trainer`

Open AmitMY opened this issue 5 months ago • 0 comments

Feature request

Add support for asynchronous evaluation in transformers.Trainer, ideally enabling evaluation to run in parallel to training — potentially on a separate GPU — without blocking the training loop.

Even ideally, to really utilize the hardware fully, one GPU should be training, the other one should be evaluating, constantly. That means, that it might take a few minutes to evaluate, and maybe save the checkpoint to disk, but afterwards, a new eval should start immediately with the latest model weights, in an infinite loop.

Motivation

In my training scenarios, especially with: slow checkpointing (e.g. network-mounted disk, NFS, HDD) and multi-GPU machines with underutilized resources the current blocking evaluation step can create a significant bottleneck. It halts the training loop and prevents full utilization of the hardware.

Image

I do not want to try to patch this on top:

  • Forking Trainer manually and running evaluation with a subprocess
  • Spawning external watchdog scripts for async eval
  • Deep copying the model to avoid disk writes

This seems like a natural fit for Trainer, similar to how save_strategy and log_strategy are already modular.

Your contribution

Some ideas (not prescriptive):

  • eval_async=True: runs evaluation in a forked process/thread
  • eval_device='cuda:1': specify a device for async eval
  • eval_strategy="async_steps": triggers parallel eval on step intervals
  • Provide a callback hook or scheduler for async checkpoint evaluation

Implementation Ideas

  • Use torch.multiprocessing or concurrent.futures.ProcessPoolExecutor to fork evaluation subprocess
  • Snapshot the model state (via state_dict() or full checkpoint) and transfer it (memory, pipe, or fast serialization)
  • Respect user-specified eval_steps, metric_for_best_model, and load_best_model_at_end behavior

AmitMY avatar Jun 15 '25 06:06 AmitMY