electra The difference of reproduced results on electra_small

I used the same hyper-parameters as the paper but generator size 1:1 with the hidden size of 256 as you claimed in #39 to pretrain a electra small model on the openwebtxt dataset. Then fine-tuned this pretrained model with EXCAT same hyper-parameters as the paper resulting in the following outcomes

	CoLA	SST	MRPC	STS	QQP	MNLI	QNLI	RTE	SQuAD 1.1	SQuAD 2.0
Metrics	MCC	Acc	Acc	Spearman	Acc	Acc	Acc	Acc	EM/F1	EM/F1
ELECTRA-Small	57.0	91.2	88.0	87.5	89.0	81.3	88.4	66.7	75.8/--	70.1/--
ELECTRA-Small-OWT	56.8	88.3	87.4	86.8	88.3	78.9	87.9	68.5	--	--
My Reproduce	51.04	85.21	83.58	84.79	87.16	75.01	84.79	66.06	60.97/70.13	59.83/62.68

There are still a huge gap bewteen my reproduce and the electra_small-owt expect RTE and I am woudering could you please share the SQuAD results on electra_small-owt to facilitate comparison of results.

In addition, I tired the generator size 1:4 with the hidden size of 64 and I've got competitive results. I am wondering why you choose to upload the generator size 1:1 as the offical released electra-small model which conficts both the paper and also the expreiment performance.

Another isssue is about the max squence length, I quote this from your archive paper

we shortened the sequence length (from 512 to 128)

which can be supported by code https://github.com/google-research/electra/blob/79111328070e491b287c307906701ebc61091eb2/configure_pretraining.py#L79 but conflcts the shape (512,128) of electra/embeddings/position_embeddings in the released electra_small model.

Does it means that open source electra_small and the electea_Small_OWT in QuickStart example not only have the difference of pre training data corpus, but also the size of generator and max sequence length.

Jun 13 '20 17:06 zheyuye

Hi! Using a smaller generator should work better; we used a larger generator for ELECTRA-Small++ (the released ELECTRA-Small model) on accident. This may have hurt its performance a bit, but I doubt by much because the smaller generator mainly helps with efficiency and we trained ELECTRA-Small++ to convergence. What do you mean by "competitive results" when using a size-64 generator? It is not possible to run ELECTRA-Small-OWT on SQuAD because its small max_seq_length is too small.

The different max sequence length shouldn't be an issue because the position embedding tensor is always [512, embedding_size] regardless of config.max_sequence_length; its size is instead defined by max_position_embeddings in the BergConfig (which I agree is a bit confusing).

Yes, the quickstart ELECTRA-Small-OWT model mimics ELECTRA-Small in the paper (but with a different dataset) but the released ELECTRA-Small++ model has a longer sequence length and a larger generator. We released ELECTRA-Small++ rather than ELECTRA-Small because it is better on downstream tasks, but we plan to release the original ELECTRA-Small model in the future.

Jun 23 '20 00:06 clarkkev

Thanks for answering. From what I understand, the smaller generator are always better by design but using and uploading a missized model is an accidient?

Jun 24 '20 03:06 zheyuye

That's right. See Figure 3 in our paper for some results with different generator sizes.

Jun 24 '20 18:06 clarkkev

@ZheyuYe Did you get the result of ELECTRA-Small-OWT by pretraining from scratch by yourself? What's the difference between ELECTRA-Small-OWT and your reproduction? Thanks :)

Jul 26 '20 02:07 amy-hyunji

@amy-hyunji I re-pretrained the electra small model from scratch with same training setting as ELECTRA-Small-OWT and fine-tuned it on GLUE branchmark in which only QQP and QNLI showed similar results with other seven datasets holding gaps of 0.4-1.5% compared with the published results.

Aug 04 '20 07:08 zheyuye

electra
electra copied to clipboard

The difference of reproduced results on electra_small_owt

electra electra copied to clipboard

The difference of reproduced results on electra_small_owt

electra
electra copied to clipboard