[qwen3] Loss is extremely hight when initialized with seed checkpoint
Bug description
The seed checkpoint was initialized on CPU with following command, and the high loss is reproducible with the seed checkpoint created in following command:
NGPU=1 CONFIG_FILE="./torchtitan/models/qwen3/train_configs/qwen3_0.6b.toml" ./run_train.sh --checkpoint.enable --checkpoint.create_seed_checkpoint --parallelism.data_parallel_replicate_degree 1 --parallelism.data_parallel_shard_degree 1 --parallelism.tensor_parallel_degree 1 --parallelism.pipeline_parallel_degree 1 --parallelism.context_parallel_degree 1 --parallelism.expert_parallel_degree 1
With the initialized checkpoint, run job with FSDP=2 and TP=2, the loss are extremely high
Versions
torchtitan main branch 6bccdb6
emm. I just noticed there is no model.safetensors.index.json file in 0.6b's repo, so torchtitan will not read the HF checkpoint at all.
but the problem also appears for 1.7b model
@rakkit
so torchtitan will not read the HF checkpoint at all.
Are you sure? @ankitageorge told me if there's no .index.json, everything (the only file) will be read.
Yes, at least from the DCP side, we can read checkpoints without index.json file without any issues
so torchtitan will not read the HF checkpoint at all.
Are you sure? @ankitageorge told me if there's no
.index.json, everything (the only file) will be read.