Adan issues

Is there a TensorFlow/Keras implementation?

Is there a TensorFlow/Keras implementation of Adan? If no official version, do you know of any third-party implementation? Or alternatively, how many lines would you expect an implementation to have?...

cmsflash

`torch._foreach...` implementation

1

Hi, very interesting work! The only problem i see is that your optimizer is slower that sgd/adamw which may discourage some people from using it. Do you plan adding an...

bonlime

About the convergence trend comparison with Adamw in ViT-H

3

Hi, Thank you very much for your brilliant work on Adan! And from you paper, it said Adan should get a lower loss (both Train and test) than Adamw according...

haihai-00

Embedding tensors/weight update unsupported

4

Hello! I think I found a bug in the Adan optimizer, which affects embedding tables. I implemented Adan optimzier in Tensorflow 2. You could find the implementation [here](https://github.com/DenisVorotyntsev/Adan) I wanted...

DenisVorotyntsev

如何设置Adan学习率

3

您好请问您是否有研究过将Adan用于Diffusion模型训练，其学习率应该如何设置，可否与使用AdamW的学习率一样？

theFoxofSky

About the pre-trained model

1

Could you please release the pre-trained ViT-S based on MAE?

casiatao

Settings for instruction-tuning

2

Hi, Adan是一个性能十分优秀的优化器，谢谢你们的工作。但我最近在尝试用Adan进行指令微调时，发现loss曲线很漂亮，但是下游任务表现（GSM-8k）不如预期。同样的数据处理和评测，AdamW大概9.63，Adan只有5.08左右。 AdamW超参数：weight_decay 0.01, lr 2e-5 Adan超参数：weight_decay 0.02，按照repo的建议lr尝试了2e-4 1e-4, GSM8k都比较低 lr scheduler都是3%升到最高然后下降到0 AdamW的训练loss曲线： Adan的训练loss曲线：使用的代码： ```python from adan import Adan optimizer = Adan(model.parameters(), lr=args.lr, weight_decay=0.02, foreach=True, fused=True) ```...

KaiLv69

在我的cnn模型中，lr=0.01时，在20-30epoch，map可以提升的很快但是后续会成为NAN。但是如果使用0.001不会直接为NAN，但是效果不好，请问这个现象代表着什么问题？谢谢！

4

![IMG_20231210_230427.jpg](https://github.com/sail-sg/Adan/assets/69030185/54b1cb82-d464-4a60-b840-faea615515dd)

liiicon

Concrete weight decay configuration for GPT-2 pretraining

1

Dear authors: According to the `README.md` of this amazing project, the `weight_decay` param should be `0.02`, while in the configuration file attached in https://github.com/sail-sg/Adan/issues/32, the `WD` seems to be `0.05`....

DesperateExplorer

processing data for BERT experiment

4

> The following steps are modified from [Fairseq-Roberta](https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.pretraining.md). For completeness, we list some key steps here. I would like to ask why you modified the dataset settings? In the original...

kenoharada

Adan
Adan copied to clipboard

Metadata

Is there a TensorFlow/Keras implementation?

`torch._foreach...` implementation

About the convergence trend comparison with Adamw in ViT-H

Embedding tensors/weight update unsupported

如何设置Adan学习率

About the pre-trained model

Settings for instruction-tuning

在我的cnn模型中，lr=0.01时，在20-30epoch，map可以提升的很快但是后续会成为NAN。但是如果使用0.001不会直接为NAN，但是效果不好，请问这个现象代表着什么问题？谢谢！

Concrete weight decay configuration for GPT-2 pretraining

processing data for BERT experiment

← Metadata

Owner

Metadata

Adan Adan copied to clipboard

Metadata

← Metadata

Owner

Metadata

Adan
Adan copied to clipboard