spacegoing comments

Results 37 comments of


                                            spacegoing

After finetune the model, inference still get noise.

> > > For fine-tuning, we offer the following suggestions: > > > > > > 1. Reduce the learning rate. We recommend a learning rate of 1e-5 to 1e-6...

After finetune the model, inference still get noise.

> > > 对于微调，我们提出以下建议： > > > > > > 1. 降低学习率。我们建议使用 1e-5 到 1e-6 的学习率进行微调。 > > > 2. 如果添加了其他模块，请加载预先训练的权重并使用零初始化进行推理。这将验证初始化或代码是否正确。 > > > 3. 时刻关注loss曲线，如果出现loss的尖峰，那么很有可能模型崩溃了，从最近的checkpoint恢复训练。如果训练过程频繁崩溃，那么可以考虑增加batch size或者继续降低学习率。 > > >...

After finetune the model, inference still get noise.

> For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen...

After finetune the model, inference still get noise.

> > > For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should...

After finetune the model, inference still get noise.

> > > > > For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the...

After finetune the model, inference still get noise.

> > spacegoing > > Hi bro, I would like to ask what your successful experience is, and what hyperparameters did you use to train 480p? When I adjusted the...

After finetune the model, inference still get noise.

@1KE-JI That's exactly what I meant. In this case we have different problems. Maybe reopen this issue / create a new one with more details. I'll help u debug.

OpenSora v1.2 only works `seed=42` on my local machine

@JThh @zhengzangw Thanks for your reply. to reproduce u can use my docker image: spacegoing/opensora. It's basically official dockerfile plus sort out versions of conflicting deps. the prompt I use...

OpenSora v1.2 only works `seed=42` on my local machine

> I do not encounter this problem. Try delete this line: > > https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/scripts/inference.py#L263 I'll try and let u know

[Bug] dist_checkpointing stuck on communication with MoE models in distributed environment

I reproduced this timeout error with qwen3moe 30B, on 2 nodes with ep=2 ![Image](https://github.com/user-attachments/assets/6c9f54d2-49cd-4ec5-9d00-ef773d715d2b) ![Image](https://github.com/user-attachments/assets/d895304a-fcbd-404f-b505-821dbb9c20cf) ``` set -x # Paths HF_MODEL_PATH=/root/myCodeLab/host/downloads/models/Qwen3-30B-A3B DIST_CKPT_PATH=/root/myCodeLab/host/downloads/models/Qwen3-30B-A3B_DIST TRAIN_FILE=/root/myCodeLab/host/downloads/datasets/dapo_data/dapo-math-17k.parquet aime24_test_path=/root/myCodeLab/host/downloads/datasets/dapo_data/aime-2024.parquet TEST_FILE="['$aime24_test_path']" RUNTIME_ENV=${RUNTIME_ENV:-"${HOME}/myCodeLab/host/verl/my_scripts/my_runtime_env.yaml"} python scripts/converter_hf_to_mcore.py --hf_model_path $HF_MODEL_PATH...