adapter-bert issues

How a near-identity initialization is implemented

2

Dear authors, After reading the code, I find the default initialization behavior for adaptor parameters (`w1`, `w2`) is initialized with a small standard deviation. Does this guarantee the projection is...

boxin-wbx

Adapters on large-datasets in GLUE could not get the same results

1

Hi I am trying adapters on Bert-base. I am evaluating on GLUE. On smaller datasets like MRPC, RTE, COLA, I see good results, but on large datasets of GLUE like...

dorost1234

How to implement adapters in case of pre norm

2

Hi, I am having a model in which normalization first happens and then there is add operation. In the paper, you discussed the post-norm case, could you tell me how...

rabeehkarimimahabadi

freezing "layer_norm" and "head"

2

Hi Could you confirm in the implementation of adapters, if layer_norm of the original model should be unfreezed? or only layer_norm inside adapter needs to be unfreezed? How about the...

rabeehkarimimahabadi

Hyperparameters of GLUE datasets

Hi, thanks for your great work! I fail to reproduce the high results on the GLUE datasets. Could you provide the hyperparameters w.r.t, training epoch, learning rate..., of the 9...

mazicwong

missing processors

1

Congratulations on the great paper! One question, do you have additional processor classes? At the moment, the code reads: `processors = { "cola": ColaProcessor, "mnli": MnliProcessor, "mrpc": MrpcProcessor, }` and...

jacobdeasy

regarding the training speed and data amount requirement

1

Thanks for the great work here. I have a question, when I read though the paper, I can understand that fewer parameters training should bring speed benefit, and please correct...

BaoshengHeTR

adapter-bert
adapter-bert copied to clipboard

Metadata

How a near-identity initialization is implemented

Adapters on large-datasets in GLUE could not get the same results