spacegoing
spacegoing
> > > For fine-tuning, we offer the following suggestions: > > > > > > 1. Reduce the learning rate. We recommend a learning rate of 1e-5 to 1e-6...
> > > 对于微调,我们提出以下建议: > > > > > > 1. 降低学习率。我们建议使用 1e-5 到 1e-6 的学习率进行微调。 > > > 2. 如果添加了其他模块,请加载预先训练的权重并使用零初始化进行推理。这将验证初始化或代码是否正确。 > > > 3. 时刻关注loss曲线,如果出现loss的尖峰,那么很有可能模型崩溃了,从最近的checkpoint恢复训练。如果训练过程频繁崩溃,那么可以考虑增加batch size或者继续降低学习率。 > > >...
> For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen...
> > > For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should...
> > > > > For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the...
> > spacegoing > > Hi bro, I would like to ask what your successful experience is, and what hyperparameters did you use to train 480p? When I adjusted the...
@1KE-JI That's exactly what I meant. In this case we have different problems. Maybe reopen this issue / create a new one with more details. I'll help u debug.
@JThh @zhengzangw Thanks for your reply. to reproduce u can use my docker image: spacegoing/opensora. It's basically official dockerfile plus sort out versions of conflicting deps. the prompt I use...
> I do not encounter this problem. Try delete this line: > > https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/scripts/inference.py#L263 I'll try and let u know
I reproduced this timeout error with qwen3moe 30B, on 2 nodes with ep=2   ``` set -x # Paths HF_MODEL_PATH=/root/myCodeLab/host/downloads/models/Qwen3-30B-A3B DIST_CKPT_PATH=/root/myCodeLab/host/downloads/models/Qwen3-30B-A3B_DIST TRAIN_FILE=/root/myCodeLab/host/downloads/datasets/dapo_data/dapo-math-17k.parquet aime24_test_path=/root/myCodeLab/host/downloads/datasets/dapo_data/aime-2024.parquet TEST_FILE="['$aime24_test_path']" RUNTIME_ENV=${RUNTIME_ENV:-"${HOME}/myCodeLab/host/verl/my_scripts/my_runtime_env.yaml"} python scripts/converter_hf_to_mcore.py --hf_model_path $HF_MODEL_PATH...