STAM
STAM copied to clipboard
Could you please share training hyper-parameters?
Hello,
This work is really inspiring, and thanks for sharing the code. Meanwhile, could you please also share the training hyper-parameters (e.g., learning rate, optimizer, warmup lr, warmup epochs, etc.)? I would really like to train the model to get a deeper understanding of the model.
Thanks, Steve
Hi,
thanks taking interest in this work.
The training hyper-parameters are (for stam_16) batch size 64, AdamW optimizer with weight decay 1e-3, 100 epochs with cosine annealing schedule and learning rate warm up (first 10 epochs). Base learning rate of 1e-5. And using model EMA.
For stam_64, same as above, except batch size: 16, and learning rate: 2.5e-6
The models were trained on single 8xV100 machine.
Hope you find this useful.
您好, 感谢您对这项工作感兴趣。 训练超参数(对于 stam_16)批量大小为 64,权重衰减为 1e-3 的 AdamW 优化器,具有余弦退火计划和学习率预热的 100 个时期(前 10 个时期)。基础学习率为 1e-5。并使用模型 EMA。 对于 stam_64,与上面相同,除了批量大小:16,学习率:2.5e-6 模型是在单个 8xV100 机器上训练的。 希望您觉得这个有帮助。
Could you please share the training code? thanks