transformers
transformers copied to clipboard
Support Asynchronous Evaluation on Separate GPU in `Trainer`
Feature request
Add support for asynchronous evaluation in transformers.Trainer, ideally enabling evaluation to run in parallel to training — potentially on a separate GPU — without blocking the training loop.
Even ideally, to really utilize the hardware fully, one GPU should be training, the other one should be evaluating, constantly. That means, that it might take a few minutes to evaluate, and maybe save the checkpoint to disk, but afterwards, a new eval should start immediately with the latest model weights, in an infinite loop.
Motivation
In my training scenarios, especially with: slow checkpointing (e.g. network-mounted disk, NFS, HDD) and multi-GPU machines with underutilized resources the current blocking evaluation step can create a significant bottleneck. It halts the training loop and prevents full utilization of the hardware.
I do not want to try to patch this on top:
- Forking Trainer manually and running evaluation with a subprocess
- Spawning external watchdog scripts for async eval
- Deep copying the model to avoid disk writes
This seems like a natural fit for Trainer, similar to how save_strategy and log_strategy are already modular.
Your contribution
Some ideas (not prescriptive):
eval_async=True: runs evaluation in a forked process/threadeval_device='cuda:1': specify a device for async evaleval_strategy="async_steps": triggers parallel eval on step intervals- Provide a callback hook or scheduler for async checkpoint evaluation
Implementation Ideas
- Use
torch.multiprocessingorconcurrent.futures.ProcessPoolExecutorto fork evaluation subprocess - Snapshot the model state (via
state_dict()or full checkpoint) and transfer it (memory, pipe, or fast serialization) - Respect user-specified
eval_steps,metric_for_best_model, andload_best_model_at_endbehavior