electra
electra copied to clipboard
The difference of reproduced results on electra_small_owt
I used the same hyper-parameters as the paper but generator size 1:1 with the hidden size of 256 as you claimed in #39 to pretrain a electra small model on the openwebtxt dataset. Then fine-tuned this pretrained model with EXCAT same hyper-parameters as the paper resulting in the following outcomes
CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE | SQuAD 1.1 | SQuAD 2.0 | |
---|---|---|---|---|---|---|---|---|---|---|
Metrics | MCC | Acc | Acc | Spearman | Acc | Acc | Acc | Acc | EM/F1 | EM/F1 |
ELECTRA-Small | 57.0 | 91.2 | 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7 | 75.8/-- | 70.1/-- |
ELECTRA-Small-OWT | 56.8 | 88.3 | 87.4 | 86.8 | 88.3 | 78.9 | 87.9 | 68.5 | -- | -- |
My Reproduce | 51.04 | 85.21 | 83.58 | 84.79 | 87.16 | 75.01 | 84.79 | 66.06 | 60.97/70.13 | 59.83/62.68 |
There are still a huge gap bewteen my reproduce and the electra_small-owt expect RTE and I am woudering could you please share the SQuAD results on electra_small-owt to facilitate comparison of results.
In addition, I tired the generator size 1:4 with the hidden size of 64 and I've got competitive results. I am wondering why you choose to upload the generator size 1:1 as the offical released electra-small model which conficts both the paper and also the expreiment performance.
Another isssue is about the max squence length, I quote this from your archive paper
we shortened the sequence length (from 512 to 128)
which can be supported by code https://github.com/google-research/electra/blob/79111328070e491b287c307906701ebc61091eb2/configure_pretraining.py#L79
but conflcts the shape (512,128) of electra/embeddings/position_embeddings
in the released electra_small model.
Does it means that open source electra_small and the electea_Small_OWT in QuickStart example not only have the difference of pre training data corpus, but also the size of generator and max sequence length.
Hi! Using a smaller generator should work better; we used a larger generator for ELECTRA-Small++ (the released ELECTRA-Small model) on accident. This may have hurt its performance a bit, but I doubt by much because the smaller generator mainly helps with efficiency and we trained ELECTRA-Small++ to convergence. What do you mean by "competitive results" when using a size-64 generator? It is not possible to run ELECTRA-Small-OWT on SQuAD because its small max_seq_length
is too small.
The different max sequence length shouldn't be an issue because the position embedding tensor is always [512, embedding_size]
regardless of config.max_sequence_length
; its size is instead defined by max_position_embeddings in the BergConfig
(which I agree is a bit confusing).
Yes, the quickstart ELECTRA-Small-OWT model mimics ELECTRA-Small in the paper (but with a different dataset) but the released ELECTRA-Small++ model has a longer sequence length and a larger generator. We released ELECTRA-Small++ rather than ELECTRA-Small because it is better on downstream tasks, but we plan to release the original ELECTRA-Small model in the future.
Thanks for answering. From what I understand, the smaller generator are always better by design but using and uploading a missized model is an accidient?
That's right. See Figure 3 in our paper for some results with different generator sizes.
@ZheyuYe Did you get the result of ELECTRA-Small-OWT by pretraining from scratch by yourself? What's the difference between ELECTRA-Small-OWT and your reproduction? Thanks :)
@amy-hyunji I re-pretrained the electra small model from scratch with same training setting as ELECTRA-Small-OWT and fine-tuned it on GLUE branchmark in which only QQP and QNLI showed similar results with other seven datasets holding gaps of 0.4-1.5% compared with the published results.