torchtitan [Evaluation] Minimal support for downstream tasks

Hello and thanks for the great work, For now torchtitan only has an evaluation on train loss. Do you have in mind to provide a minimal support for a downstream task like for example a general knowledge score on MMLU? The aim would be to provide the minimum necessary to accomplish a downstream task, a bit like the minimal example with a HuggingFace dataset (c4 in this case) while trying to keep the native pytorch spirit as much as possible. If so, can I participate by initiating a PR?

Feb 24 '25 16:02 K-H-Ismail

Hey @K-H-Ismail Thanks for proposing!

Yes, basic eval capability is on our plan, but we haven't got time to do. It would be great if you could init a PR on it. I'd be happy to work with you.

Do you think it makes sense to start with an (brief) overall design, so that we can get aligned on what it would look like?

Feb 24 '25 21:02 tianyu-l

Hello @tianyu-l,

Thank you very much! I'd be honoured to work with you on this task!

Yes, discussing the design will be helpful and could prevent extra work.

I was thinking about creating a torchtitan/tasks directory. The tasks rely usually on datasets. As a first step we might restrict ourselves to HF datasets. A good feature would be that tasks could be evaluated during training (periodically like for checkpoints) and also in a standalone fashion when providing a pretrained model checkpoint. Evaluating multiple tasks at the same time could be a further enhancement.

Of course those are just suggestion and we will follow the layout you and the main maintainers will consider.

Sincerely, Ismail.

Feb 24 '25 21:02 K-H-Ismail

Hey, I've already implemented something WIP here: https://github.com/janEbert/torchtitan/commit/72c7b4e5521de7c336b51dca22fcd75f50aa8f25

The main part is the use of the _ScheduleForwardOnly pipeline schedule for evaluation, the rest is just using the CheckpointManager as a base for implementing an EvalManager, that is neither tested, nor fully implemented, yet.

Feb 26 '25 07:02 janEbert

@K-H-Ismail Sorry for the delayed response!

I think your proposal makes sense in general. My main high-level question is:

Compared with existing solutions e.g. lm_eval, what's the trade-off for writing evals by ourselves in torchtitan? I saw that both torchtune and gpt-fast integrate with lm_eval. From my perspective, it's good to not depend on third-party libraries if we only need basic functionality and make everything clean.

Some suggestions on file structure (tentative):

tasks could be evaluated during training (periodically like for checkpoints)

Maybe we can put an eval.py in components to start with, and have a minimal interface call in train.py (with an eval_freq config). We can use the existing metric_logger to log eval results to TensorBoard / WandB.

a standalone fashion when providing a pretrained model checkpoint.

We can have a file / folder similar to https://github.com/pytorch/torchtitan/tree/main/scripts/generate

Mar 03 '25 00:03 tianyu-l

Hey, I've already implemented something WIP here: janEbert@72c7b4e

The main part is the use of the _ScheduleForwardOnly pipeline schedule for evaluation, the rest is just using the CheckpointManager as a base for implementing an EvalManager, that is neither tested, nor fully implemented, yet.

Hi @janEbert do you have any plans continue working on this Eval part? We got a lot of ask regarding enabling Eval, and I think your implementation would help a lot

May 02 '25 00:05 wwwjn

hi @K-H-Ismail would you still want to work on this feature?

May 02 '25 02:05 tianyu-l

Hey @wwwjn, sadly, I don't plan to continue on a pure perplexity-based evaluation based on that commit. Instead, I've implemented a (currently very messy and WIP) client-server architecture that could also be queried from a HuggingFace PretrainedModel wrapper, enabling both perplexity but also other evaluations through the various evaluation harnesses like LM-Eval or LightEval. The server is a completely independent (and distributed) process from the training one, thus there's native asynchronicity. The implementation follows TorchTitan's principles and has no dependencies (except if you also wanted an optional PretrainedModel wrapper based on it). An additional benefit is that, due to re-using TorchTitan components, most upstream changes to the model should not require maintenance of the server code.

Sadly, there are still some open questions on how to handle certain things without modifying the model code too much. I started with a start_pos: int-based KV cache (like in the official Llama model repo), but this is not flexible enough, so now I wonder whether to simply start supporting arbitrary position_id inputs. Then there's smaller modifications like saving the training config upon start of the training to a known location for later loading, supporting pad_id tokens more explicitly (unless NestedTensor works well and performantly with TorchTitan? I haven't tried that, thinking of it), etc.

If such general changes would be allowed, I could contribute the implementation. However, I'm currently extremely busy due to being constrained by compute periods of shared supercomputers, so it would take some time (earliest in mid-July) until I can clean it up for a nice and desirable PR.

May 02 '25 10:05 janEbert

Dear all,

in-training validation and evalution is an important feature for many users including myself.

So far I have been exploring/trying torchtune and its Eleuther eval harness builtin evaluation as suggested before in this issue. This is a good solution for off-training eval as torchtitan checkpoints could interface with torchtune.

However, in-train eval (at least for validation loss) is still very important for many researchers/parctitionners.

I'm pretty much starting to grasp the librarie composants and the FSDP2 used here, however I'm not quite up to date on the other types of parallelism yet.

I think a valid version that covers models up to 8b with FSDP and/or HSDP would already make a lot of users happy, including me.

I'll be somewhat freer after May 15 and could participate, depending on my current development skills.

May 02 '25 14:05 K-H-Ismail

@K-H-Ismail For in-training validation, would an implementation like the one in torchtune be enough for you? One could custom a dataset and run forward on it, with the same loss used in training. https://github.com/pytorch/torchtune/blob/0991f97ef13735ea6d458db22137a2796f8fbf92/recipes/full_finetune_distributed.py#L1034-L1039

This is in contrast to general eval dataset/capabilities available in lm_eval.

May 19 '25 15:05 tianyu-l

@janEbert

I've implemented a (currently very messy and WIP) client-server architecture that could also be queried from a HuggingFace PretrainedModel wrapper

For the client-server architecture, I'm assuming you are doing some sort of RL workload so that this has to happen during training, not offline using a script to convert checkpoints? If so have you considered using a single-controller like Ray for this task?

An additional benefit is that, due to re-using TorchTitan components, most upstream changes to the model should not require maintenance of the server code.

I'm curious what the queries look like. It sounds that you are changing the sharding from training using torchtitan to inference/eval using HF. Wouldn't it needs some weight updating mechanism which needs to know the details of the trainer, like in https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/fsdp_vllm.py#L198?

May 20 '25 03:05 tianyu-l

@tianyu-l Absolutely!

May 20 '25 07:05 K-H-Ismail

@tianyu-l

I've implemented a (currently very messy and WIP) client-server architecture that could also be queried from a HuggingFace PretrainedModel wrapper

For the client-server architecture, I'm assuming you are doing some sort of RL workload so that this has to happen during training, not offline using a script to convert checkpoints? If so have you considered using a single-controller like Ray for this task?

It's actually simply a "lazy" solution for evaluation during an experimental phase. The TorchTitan Llama model with all its various settings, implemented not only inside the model, but also outside, is quite a lot of work to adapt to HuggingFace APIs. If implementing additional features in TorchTitan, the HF conversion would have to be updated as well, increasing future maintenance workload. I wanted something that automatically adapts to changes to the TorchTitan model, therefore I chose this solution while still actively developing the model itself. Additionally, TorchTitan has features which HF itself does not support in its API, and if wanted to support those features properly, I'd have to work against the HF APIs.

In the end when actually releasing, a proper HF conversion is still desirable. At that point I'd know how the final model looks and do not have to worry so much about supporting every single API of TorchTitan.

I have not considered an external library because I really like the minimal-dependencies approach of TorchTitan. :)

An additional benefit is that, due to re-using TorchTitan components, most upstream changes to the model should not require maintenance of the server code.

I'm curious what the queries look like. It sounds that you are changing the sharding from training using torchtitan to inference/eval using HF. Wouldn't it needs some weight updating mechanism which needs to know the details of the trainer, like in https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/fsdp_vllm.py#L198?

The queries would simply come from an evaluation or chat framework, if that's what you mean by "queries". I deliberately do not change the sharding to be able to simply re-use the training settings for evaluation; even an extremely large model should "just work" without additional configuration. As for the HF side, the HF-wrapped model runs in a single process, queries the distributed server, and collects the outputs. So to the external framework, the model is actually not distributed; that part is deliberately decoupled/hidden to avoid working with HF APIs. :p

I also currently don't care about gradients in this server, but it should be simple to add those. I currently can't imagine how easy it would be to add proper training functionality to the server, though, but it also shouldn't be too crazy once gradients are available.

May 20 '25 08:05 janEbert

@K-H-Ismail validation added in https://github.com/pytorch/torchtitan/pull/1362

Jul 10 '25 03:07 tianyu-l

@tianyu-l Thanks!

Jul 10 '25 12:07 K-H-Ismail