torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

[WIP]Implement llama4 HF format to DCP converter

Open fegin opened this issue 8 months ago • 3 comments

Why do we need this? There have been a lot of asks to get the HF checkpoints work with TorchTitan. There are already workarounds for this problem. However, the converted DCP checkpoints generally results in slow loading when being used for the next training.

This PR tries to address this problem by resharding the full weights into the correct sharding that will be used for training later. So we basically perform an offline resharding first to avoid long loading time later (online resharding).

This converter also perform concurrent file loads using multiple trainers.

While this PR should perform reasonably well, the converter requires using exactly the same machines/GPUs and sharding for the conversion. The main blocker for using CPU machines to do the conversion is that we are unable to run torchtitan with CPU only machines.

An alternative is to use less machines than the training machines to do the conversion. This will work but an additional resharding will happen during the actual training loading, which may not perform well, depending on the resharding patterns.

Future extensions

  1. Directly reading from huggingface without downloading it first (will come in the next PR).
  2. While this converter is written for llama4, the logic can be generalized to all other models with some cutomized functions (e.g., FQN mapping).
  3. Explore the possibility to perform the conversion with GPUs and still get the correct sharding scheme.

fegin avatar Apr 15 '25 23:04 fegin

lol, okay, do we want to keep the one in experiments or actually have the ones in the main scripts?

fegin avatar Apr 16 '25 00:04 fegin

I think we should move them to local folder, as we are having more models. I even think we should move the current "main scripts" into the llama3 folder, lol

tianyu-l avatar Apr 16 '25 00:04 tianyu-l

okay, since you already merge them, I'll make this PR to be fixing the issues. But I'll keep the description of the PR since I would like to track the converter development if we want to generalize it to other models.

fegin avatar Apr 16 '25 05:04 fegin