torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

[qwen3] Loss is extremely hight when initialized with seed checkpoint

Open wwwjn opened this issue 2 months ago • 3 comments

Bug description

The seed checkpoint was initialized on CPU with following command, and the high loss is reproducible with the seed checkpoint created in following command:

NGPU=1 CONFIG_FILE="./torchtitan/models/qwen3/train_configs/qwen3_0.6b.toml" ./run_train.sh --checkpoint.enable --checkpoint.create_seed_checkpoint --parallelism.data_parallel_replicate_degree 1 --parallelism.data_parallel_shard_degree 1 --parallelism.tensor_parallel_degree 1 --parallelism.pipeline_parallel_degree 1 --parallelism.context_parallel_degree 1 --parallelism.expert_parallel_degree 1

With the initialized checkpoint, run job with FSDP=2 and TP=2, the loss are extremely high

Image

Versions

torchtitan main branch 6bccdb6

wwwjn avatar Oct 15 '25 05:10 wwwjn

emm. I just noticed there is no model.safetensors.index.json file in 0.6b's repo, so torchtitan will not read the HF checkpoint at all.

but the problem also appears for 1.7b model

rakkit avatar Oct 22 '25 16:10 rakkit

@rakkit

so torchtitan will not read the HF checkpoint at all.

Are you sure? @ankitageorge told me if there's no .index.json, everything (the only file) will be read.

tianyu-l avatar Oct 23 '25 07:10 tianyu-l

Yes, at least from the DCP side, we can read checkpoints without index.json file without any issues

@rakkit

so torchtitan will not read the HF checkpoint at all.

Are you sure? @ankitageorge told me if there's no .index.json, everything (the only file) will be read.

ankitageorge avatar Oct 23 '25 21:10 ankitageorge