frugally-deep Feature Suggestion: Support Transformer Models

First off, I would like to say that this is a really great piece of work! I have been using it with LSTMs for time-series data and have found frugally-deep to be invaluable. I am starting to investigate Transformers in order to see how they stack up to LSTMs and it would be wonderful if support for Transformer models could be added. I am in the early stages of working with Transformers, but the specific layers that I currently do not see supported are: MultiHeadAttention and LayerNormalization.

May 13 '22 13:05 jonathan-lazzaro-nnl

Hi, and thanks for the nice feedback! :blush: I'm happy the lib is of help to you.

The LayerNormalization layer looks doable.

Regarding the MultiHeadAttention layer, so far, I have no idea how it works, i.e., I'd have to learn about it first.

Or would you be interested in giving the implementation a try?

May 13 '22 16:05 Dobiasd

Unfortunately, my C++ programming skills are not developed enough right now. As a result, I would be both coming up to speed with Transformers and getting better with C++.

May 13 '22 16:05 jonathan-lazzaro-nnl

Ok, I see. :+1:

What is the priority/importance of this for you? Is it that you'd try it out as a fun experiment, or is some big company project being at stake? :wink:

May 13 '22 16:05 Dobiasd

I would like to use it in a work project provided the performance of Transformers is at least as good as LSTMs for the time-series data that I am working with. I am just getting started with my learning and investigation into Transformers, but based on the rave reviews I expect them to perform very well. I will likely be spending the next couple of months part-time developing and testing models in Tensorflow before I would really start thinking about converting a .h5 model to a .json model.

May 13 '22 20:05 jonathan-lazzaro-nnl

Hi, actually I'm dealing with the same issue. I'm trying to use transformers for my Ph.D., and converting the Keras model to cpp one would be a critical step.

I wonder if it's possible to include transformer layers in the library, that would be great for the research team. And note that transformers are trendy in recent months.

Nov 27 '22 14:11 mkherchouche

Yeah, it would be cool to have frugally-deep support such models. :+1:

The following layer types would suffice, right?

tf.keras.layers.LayerNormalization
tf.keras.layers.UnitNormalization
tf.keras.layers.AdditiveAttention
tf.keras.layers.Attention
tf.keras.layers.MultiHeadAttention

Nov 27 '22 17:11 Dobiasd

I would like to help you if you need any. :-)

Nov 27 '22 18:11 ahmed-masud

Yeah, that's what we really need, also the Embading layer, but it's already in place so thank you!!

Nov 27 '22 18:11 mkherchouche

@ahmed-masud That sounds great! :heart: Which layer types would you be interested in implementing? I'd be happy about a pull request for any of the net-yet-supported layer types. If I can be of any help to get you started, please let me know. :slightly_smiling_face:

Nov 27 '22 18:11 Dobiasd

I think I will probably take a crack at tf.keras.layers.Attention seems to be the simplest :-)

Nov 27 '22 18:11 ahmed-masud

Hi @Dobiasd, I would be happy to contribute since this will be very helpful for my work - I'll take a look into the Normalization layers, and will be in touch if I have any luck:)

Dec 03 '22 21:12 Mo-Ghani

Ok, I've implemented support for the Attention layer (only with use_scale=False and score_mode='dot'), and will see if I can do AdditiveAttention and MultiHeadAttention too. :mechanic:

But if anyone wants to take over, I'd be happy. Just let me know when you do, so we avoid duplicate work. :slightly_smiling_face:

Apr 29 '23 09:04 Dobiasd

Just to keep you updated: I just added support for use_scale = True and score_mode = 'concat' to our Attention layer implementation. :heavy_check_mark:

Aug 06 '23 14:08 Dobiasd

Update:

I added support for AdditiveAttention support some time ago. :white_check_mark:
And currently, I'm trying to understand/implement MultiHeadAttention, which looks quite complex to me, not only because of the call-convention quirk. :sweat_smile:

Nov 07 '23 07:11 Dobiasd

FYI I'm just a curious bystander but I'd love to see your LayerNorm implementation or if you had any ideas for how you would implement it.

Nov 18 '23 23:11 sevagh

Hi @sevagh: If you'd like to give the implementation of LayerNormalization a how I suggest first having a look at "How to use custom layers?" in the FAQ. Implementing a new layer (existing TF layer) is very similar to implementing a custom layer. If you have any specific questions, please let me know. Also feel free to push your unfinished implementation as a draft Pull Request, so we can look at it together. :slightly_smiling_face:

Nov 19 '23 08:11 Dobiasd

My project is slightly different (I'm implementing a PyTorch neural network using Eigen/C++). As such none of my code is directly applicable to this repo, but I occasionally look at your code to learn Eigen syntax and tricks.

In my particular case, the issue in my naive layer_norm was the unstable variance calculation for the input, leading to different output compared to the PyTorch LayerNorm.

Using Welford's algorithm helped improve my results.

Nov 20 '23 11:11 sevagh

💡 Ah, thanks. 👍 I'll try to implement this and let you know when done.

Nov 20 '23 12:11 Dobiasd

BTW I will be happy to share my code as soon as it's ready. Although like I said, the goals are different from frugally-deep, I have implemented (out of necessity for my application) a lot of transformer operations e.g. Multi-head attention

Nov 21 '23 16:11 sevagh

Thanks. Your MultiHeadAttention might be interesting, since I've not yet managed to understand the TensorFlow implementation well enough to implement it in frugally-deep too.

Yesterday and today I worked on implementing LayerNormalization. But I first need to implement the needed prerequisites like multi-axes moments (mean and variance) and extend my BatchNormalization implementation, so that I can forward to it from LayerNormalization.

I'll ping you when it's finished.

Nov 25 '23 17:11 Dobiasd

FYI: LayerNormalization is now available in frugally-deep. :tada:

Nov 28 '23 11:11 Dobiasd

FYI: UnitNormalization is now also available in frugally-deep. :tada:

(Only MultiHeadAttention is still missing. :mechanic:)

Nov 28 '23 17:11 Dobiasd

Here's the copy-pasted subsection of multi-head attention from my code: https://gist.github.com/sevagh/b71d253a347a9b59c026580625452fc5

It's messy but I hope part of it helps. Let me know

Nov 28 '23 18:11 sevagh

By the way, my project is now released, so I can link more of the crosstransformer:

https://github.com/sevagh/demucs.cpp/blob/main/src/crosstransformer.cpp
https://github.com/sevagh/demucs.cpp/blob/main/src/crosstransformer.hpp
These rely on the common encoder layer (same as I linked above but optimized): https://github.com/sevagh/demucs.cpp/blob/main/src/layers.cpp#L334

These correspond to the following transformer classes in PyTorch in the Demucs model: https://github.com/facebookresearch/demucs/blob/main/demucs/transformer.py

https://github.com/facebookresearch/demucs/blob/main/demucs/transformer.py#L271 https://github.com/facebookresearch/demucs/blob/main/demucs/transformer.py#L380 https://github.com/facebookresearch/demucs/blob/main/demucs/transformer.py#L526

These build on (but are not directly) the base Transformer classes of PyTorch, but some of it might be helpful!

Dec 08 '23 15:12 sevagh

Support for MultiHeadAttention layers is now implemented in the latest version of frugally-deep. :tada:

Dec 31 '23 18:12 Dobiasd

frugally-deep frugally-deep copied to clipboard

Feature Suggestion: Support Transformer Models

frugally-deep
frugally-deep copied to clipboard