torchtitan [Feature] Support validation

For some workloads, it is really important to perform validation on a different dataset every n iterations.

This seems reasonably straight forward to add to the training loop and training specs, while being kept as optional.

Is there any plan to support this functionality in the near future?

Apr 28 '25 11:04 CarlosGomes98

Generally speaking, yes we would love to support generalized validation function. We would love to add some function in train.py , eg eval_step(), eval(). But this work might need more refactor, and might take some time.

For Flux model specifically , currently we have a 1st version of validation (eval function to generate images every few steps). The next step is to a add numerical evaluation (calculate the loss with fixed noise level on a validation dataset, generate images based on prompts from validation dataset). I'm working on this now and might support Flux validation soon.

Apr 28 '25 18:04 wwwjn

@wwwjn Do you think your implementation is generalized enough to put it to the core train.py?

Apr 28 '25 22:04 fegin

@wwwjn Do you think your implementation is generalized enough to put it to the core train.py?

Currently we only run the eval on Rank0 (which is not generalized to run eval in parallel). I'm still implementing eval functionality for Flux model, and I will try to generalize the functionality as soon as possible

Apr 28 '25 22:04 wwwjn

IMO we should unify this work with general eval functionality, requested e.g. in https://github.com/pytorch/torchtitan/issues/883

There are two modes of eval, off train.py and in train.py. For the in-train.py version, we may have another torchtitan/components file and maybe even another TrainSpec item if it needs to be customized. It is important that we figure out the right level of abstraction. cc @fegin @wwwjn

Apr 29 '25 03:04 tianyu-l