grover Is the model structure exactly the same as GPT-2?

Hi there, great work! I'm trying to port the Grover model into the huggingface/transformers repo Is model structure exactly the same as GPT-2? thanks for your reply!

Mar 04 '20 19:03 northfoxz

After reading the code, I find out that there are some structural differences between your implementation and that of Openai's. Specifically the normalization process:

openai's implementation:

def block(x, scope, *, past, hparams):
    with tf.variable_scope(scope):
        nx = x.shape[-1].value
        a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
        x = x + a
        m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams)
        x = x + m
        return x, present

ln_1 of each block

Applying norm to the input before attention

ln_2 of each block

Applying norm to the input before the fully-connected layer

Grover's implementation

def residual_mlp_layer(x_flat, intermediate_size, initializer_range=0.02, hidden_dropout_prob=0.1):
    batch_size_seq_length, hidden_size = get_shape_list(x_flat, expected_rank=2)
    x_norm = layer_norm(x_flat, name='mlp_ln0')

    intermediate_output = tf.layers.dense(
        x_norm,
        intermediate_size,
        activation=gelu,
        kernel_initializer=create_initializer(initializer_range),
        name='intermediate',
    )

    output_for_residual = tf.layers.dense(
        intermediate_output,
        hidden_size,
        name='output',
        kernel_initializer=create_initializer(initializer_range))
    output_for_residual = dropout(output_for_residual, hidden_dropout_prob)

    layer_output = layer_norm(x_flat + output_for_residual, name='mlp_ln1')
    return layer_output

Grover applies 2 normalizations in fully-connected layer

That makes the structure different from the OpenAI's implementation, thus I'm unable to transfer this model to Huggingfaces's repo.

Mar 05 '20 15:03 northfoxz

sorry for taking a while to get to this one! I believe it's actually the same, since iirc there's an extra layer normalization somewhere else in the openai code. that said, the layer normalizations might not match up in terms of naming...

Mar 30 '20 16:03 rowanz

Hi @NorthFoxz . Were you able to determine if the difference is only on naming, or if it is structural? If the difference is on the names maybe a grover model can be converted to make it compatible with Huggingface.

Aug 08 '20 16:08 EibrielInv

@EibrielInv well it is structural with slight difference, you will have to modify the gpt2 model code a bit to make it work.

Aug 09 '20 05:08 northfoxz

@NorthFoxz ~ Did you ever attempt to get it to port across into the huggingface/transformers repo by adjusting the GPT2 code?

Jul 03 '21 05:07 RinaldoG

Have you ever made a progress on this one? @NorthFoxz Only thing I have found is this: https://huggingface.co/gagan3012/distilbert-fakenews-model-grover But nothing else is there?

Apr 06 '22 17:04 dsvilarkovic

grover grover copied to clipboard

Is the model structure exactly the same as GPT-2?

openai's implementation:

ln_1 of each block

ln_2 of each block

Grover's implementation

grover
grover copied to clipboard