electra icon indicating copy to clipboard operation
electra copied to clipboard

The difference of reproduced results on electra_small_owt

Open zheyuye opened this issue 4 years ago • 5 comments

I used the same hyper-parameters as the paper but generator size 1:1 with the hidden size of 256 as you claimed in #39 to pretrain a electra small model on the openwebtxt dataset. Then fine-tuned this pretrained model with EXCAT same hyper-parameters as the paper resulting in the following outcomes

CoLA SST MRPC STS QQP MNLI QNLI RTE SQuAD 1.1 SQuAD 2.0
Metrics MCC Acc Acc Spearman Acc Acc Acc Acc EM/F1 EM/F1
ELECTRA-Small 57.0 91.2 88.0 87.5 89.0 81.3 88.4 66.7 75.8/-- 70.1/--
ELECTRA-Small-OWT 56.8 88.3 87.4 86.8 88.3 78.9 87.9 68.5 -- --
My Reproduce 51.04 85.21 83.58 84.79 87.16 75.01 84.79 66.06 60.97/70.13 59.83/62.68

There are still a huge gap bewteen my reproduce and the electra_small-owt expect RTE and I am woudering could you please share the SQuAD results on electra_small-owt to facilitate comparison of results.

In addition, I tired the generator size 1:4 with the hidden size of 64 and I've got competitive results. I am wondering why you choose to upload the generator size 1:1 as the offical released electra-small model which conficts both the paper and also the expreiment performance.

Another isssue is about the max squence length, I quote this from your archive paper

we shortened the sequence length (from 512 to 128)

which can be supported by code https://github.com/google-research/electra/blob/79111328070e491b287c307906701ebc61091eb2/configure_pretraining.py#L79 but conflcts the shape (512,128) of electra/embeddings/position_embeddings in the released electra_small model.

Does it means that open source electra_small and the electea_Small_OWT in QuickStart example not only have the difference of pre training data corpus, but also the size of generator and max sequence length.

zheyuye avatar Jun 13 '20 17:06 zheyuye

Hi! Using a smaller generator should work better; we used a larger generator for ELECTRA-Small++ (the released ELECTRA-Small model) on accident. This may have hurt its performance a bit, but I doubt by much because the smaller generator mainly helps with efficiency and we trained ELECTRA-Small++ to convergence. What do you mean by "competitive results" when using a size-64 generator? It is not possible to run ELECTRA-Small-OWT on SQuAD because its small max_seq_length is too small.

The different max sequence length shouldn't be an issue because the position embedding tensor is always [512, embedding_size] regardless of config.max_sequence_length; its size is instead defined by max_position_embeddings in the BergConfig (which I agree is a bit confusing).

Yes, the quickstart ELECTRA-Small-OWT model mimics ELECTRA-Small in the paper (but with a different dataset) but the released ELECTRA-Small++ model has a longer sequence length and a larger generator. We released ELECTRA-Small++ rather than ELECTRA-Small because it is better on downstream tasks, but we plan to release the original ELECTRA-Small model in the future.

clarkkev avatar Jun 23 '20 00:06 clarkkev

Thanks for answering. From what I understand, the smaller generator are always better by design but using and uploading a missized model is an accidient?

zheyuye avatar Jun 24 '20 03:06 zheyuye

That's right. See Figure 3 in our paper for some results with different generator sizes.

clarkkev avatar Jun 24 '20 18:06 clarkkev

@ZheyuYe Did you get the result of ELECTRA-Small-OWT by pretraining from scratch by yourself? What's the difference between ELECTRA-Small-OWT and your reproduction? Thanks :)

amy-hyunji avatar Jul 26 '20 02:07 amy-hyunji

@amy-hyunji I re-pretrained the electra small model from scratch with same training setting as ELECTRA-Small-OWT and fine-tuned it on GLUE branchmark in which only QQP and QNLI showed similar results with other seven datasets holding gaps of 0.4-1.5% compared with the published results.

zheyuye avatar Aug 04 '20 07:08 zheyuye