torchtitan [qwen3] Loss is extremely hight when initialized with seed checkpoint

Bug description

The seed checkpoint was initialized on CPU with following command, and the high loss is reproducible with the seed checkpoint created in following command:

NGPU=1 CONFIG_FILE="./torchtitan/models/qwen3/train_configs/qwen3_0.6b.toml" ./run_train.sh --checkpoint.enable --checkpoint.create_seed_checkpoint --parallelism.data_parallel_replicate_degree 1 --parallelism.data_parallel_shard_degree 1 --parallelism.tensor_parallel_degree 1 --parallelism.pipeline_parallel_degree 1 --parallelism.context_parallel_degree 1 --parallelism.expert_parallel_degree 1

With the initialized checkpoint, run job with FSDP=2 and TP=2, the loss are extremely high

Versions

torchtitan main branch 6bccdb6

Oct 15 '25 05:10 wwwjn

emm. I just noticed there is no model.safetensors.index.json file in 0.6b's repo, so torchtitan will not read the HF checkpoint at all.

but the problem also appears for 1.7b model

Oct 22 '25 16:10 rakkit

@rakkit

so torchtitan will not read the HF checkpoint at all.

Are you sure? @ankitageorge told me if there's no .index.json, everything (the only file) will be read.

Oct 23 '25 07:10 tianyu-l

Yes, at least from the DCP side, we can read checkpoints without index.json file without any issues

@rakkit

so torchtitan will not read the HF checkpoint at all.

Are you sure? @ankitageorge told me if there's no .index.json, everything (the only file) will be read.

Oct 23 '25 21:10 ankitageorge