keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Adding `embed_dim` argument to `TransformerEncoder` and `TransformerDecoder` layers

Open DavidLandup0 opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe.

Currently, both TransformerEncoder and TransformerDecoder accept an intermediate_dim argument, allowing users to specify the latent space dimensionality, but inherit the output/embedding dimensionality from the input. With non-standard applications (when you don't first pass the input through a TokenAndPositionEmbedding), this is limiting - such as for image captioning where the image dimensionality may be higher than useful (say, 1280 bottleneck features). This is avoidable by adding a Dense layer to reduce the dimensionality before feeding into the encoder, but that's an extra layer which shouldn't be necessary, IMO.

Describe the solution you'd like

Can I create a pull request with a change to these layers? Both would default to inheriting the dimensionality (backward-compatible even though it's still a new lib), and have an optional embed_dim argument, allowing users to specify them if need be, for more flexibility, but which they can leave be for the current behavior.

Describe alternatives you've considered

Additional context

DavidLandup0 avatar Jul 12 '22 18:07 DavidLandup0

@DavidLandup0 Thanks for reporting the PR!

I would like to understand more about your concern - are you saying it is possible that the input does not have a feature dimension? In your example, you want to have a dense layer before calling the TransformerEncoder, but would it still have the feature dimension? If possible, could you share a colab of your proposed workflow? thanks!

chenmoneygithub avatar Jul 12 '22 23:07 chenmoneygithub

Sorry if I misrepresented it - the situation I was talking about is a bit more specific. For image captioning, the input to the encoder is the output of a ConvNet in my case, which has a shape of (7, 7, 1280). It's reshaped into a "sequence" of (49, 1280) for the encoder to encode. Naturally, the encoder projects it into a 1280-dim latent representation, because it inherits the dimensionality of the input data:

input_tensor = keras.Input(shape=[49, 1280])
output = keras_nlp.layers.TransformerEncoder(intermediate_dim=256, num_heads=8)(input_tensor)
model = keras.Model(inputs=input_tensor, outputs=output)

input_data = tf.random.uniform(shape=[1, 49, 1280])
output = model(input_data)
print(output.shape) # (1, 49, 1280)

Now, this can be circumvented via a Dense layer:

input_tensor = keras.Input(shape=[49, 1280])
dense = keras.layers.Dense(128)(input_tensor)
output = keras_nlp.layers.TransformerEncoder(intermediate_dim=256, num_heads=8)(dense)
model = keras.Model(inputs=input_tensor, outputs=output)

input_data = tf.random.uniform(shape=[1, 49, 1280])
output = model(input_data)
print(output.shape) # (1, 49, 128)

But this feels like a workaround. Something like this would be more flexible, IMO:

input_tensor = keras.Input(shape=[49, 1280])
# Feature request
output = keras_nlp.layers.TransformerEncoder(intermediate_dim=256, embed_dim=128 num_heads=8)(input_tensor)
model = keras.Model(inputs=input_tensor, outputs=output)

input_data = tf.random.uniform(shape=[1, 49, 1280])
output = model(input_data)
print(output.shape) # (1, 49, 128)

I'd love to create a pull request with this change if that's okay :)

DavidLandup0 avatar Jul 13 '22 08:07 DavidLandup0

@DavidLandup0 Sorry for getting back late! A little trapped on random stuff.

Thanks for putting up the code example, it's very clean! We talked about this issue when we design the interface of TransformerEncoder actually, and found that commonly people set the same size of input and output, and the workaround is pretty straightforward. One design philosophy we follow is to avoid long arg list, which can be confusing to users. So in this case, we may not add this embed_dim, and require the additional dense layer to handle the shape.

chenmoneygithub avatar Jul 18 '22 05:07 chenmoneygithub

I think the issue here is with the residual. The inputs are summed with the MHA output, and that output is summed with FF output. That effectively enforce the constraint that input size == MHA output size == FF output size.

You could add a dense layer before or after an encoder/decoder block to change the dimensionality, but that would not be part of the standard transformer block as defined in Attn is All You Need. So readability wise, it's actually better to have the explicit Dense there in the code, making it clear your are change your feature dim size.

As a high level note, we don't expect it to be possible to use TransformerEncoder/TransformerDecoder for every possible attn variant, residual/norm variant, architecture variant for transformers. There are so many out there this would not be feasible with a single layer. We will continue to add separate layers for important variants of the transformer blocks. And it is always 100% ok to simply fork the TransformerEncoder/TransformerDecoder layers with the precise changes you need.

mattdangerw avatar Jul 21 '22 22:07 mattdangerw