transformers Support Asynchronous Evaluation on Separate GPU in `Trainer`

Support Asynchronous Evaluation on Separate GPU in `Trainer`

Open AmitMY opened this issue 5 months ago • 0 comments

Feature request

Add support for asynchronous evaluation in transformers.Trainer, ideally enabling evaluation to run in parallel to training — potentially on a separate GPU — without blocking the training loop.

Even ideally, to really utilize the hardware fully, one GPU should be training, the other one should be evaluating, constantly. That means, that it might take a few minutes to evaluate, and maybe save the checkpoint to disk, but afterwards, a new eval should start immediately with the latest model weights, in an infinite loop.

Motivation

In my training scenarios, especially with: slow checkpointing (e.g. network-mounted disk, NFS, HDD) and multi-GPU machines with underutilized resources the current blocking evaluation step can create a significant bottleneck. It halts the training loop and prevents full utilization of the hardware.

I do not want to try to patch this on top:

Forking Trainer manually and running evaluation with a subprocess
Spawning external watchdog scripts for async eval
Deep copying the model to avoid disk writes

This seems like a natural fit for Trainer, similar to how save_strategy and log_strategy are already modular.

Your contribution

Some ideas (not prescriptive):

eval_async=True: runs evaluation in a forked process/thread
eval_device='cuda:1': specify a device for async eval
eval_strategy="async_steps": triggers parallel eval on step intervals
Provide a callback hook or scheduler for async checkpoint evaluation

Implementation Ideas

Use torch.multiprocessing or concurrent.futures.ProcessPoolExecutor to fork evaluation subprocess
Snapshot the model state (via state_dict() or full checkpoint) and transfer it (memory, pipe, or fast serialization)
Respect user-specified eval_steps, metric_for_best_model, and load_best_model_at_end behavior

Jun 15 '25 06:06 AmitMY

transformers transformers copied to clipboard

Support Asynchronous Evaluation on Separate GPU in `Trainer`

Feature request

Motivation

Your contribution

transformers
transformers copied to clipboard