frugally-deep
frugally-deep copied to clipboard
Feature Suggestion: Support Transformer Models
First off, I would like to say that this is a really great piece of work! I have been using it with LSTMs for time-series data and have found frugally-deep to be invaluable. I am starting to investigate Transformers in order to see how they stack up to LSTMs and it would be wonderful if support for Transformer models could be added. I am in the early stages of working with Transformers, but the specific layers that I currently do not see supported are: MultiHeadAttention and LayerNormalization.
Hi, and thanks for the nice feedback! :blush: I'm happy the lib is of help to you.
The LayerNormalization
layer looks doable.
Regarding the MultiHeadAttention
layer, so far, I have no idea how it works, i.e., I'd have to learn about it first.
Or would you be interested in giving the implementation a try?
Unfortunately, my C++ programming skills are not developed enough right now. As a result, I would be both coming up to speed with Transformers and getting better with C++.
Ok, I see. :+1:
What is the priority/importance of this for you? Is it that you'd try it out as a fun experiment, or is some big company project being at stake? :wink:
I would like to use it in a work project provided the performance of Transformers is at least as good as LSTMs for the time-series data that I am working with. I am just getting started with my learning and investigation into Transformers, but based on the rave reviews I expect them to perform very well. I will likely be spending the next couple of months part-time developing and testing models in Tensorflow before I would really start thinking about converting a .h5 model to a .json model.
Hi, actually I'm dealing with the same issue. I'm trying to use transformers for my Ph.D., and converting the Keras model to cpp one would be a critical step.
I wonder if it's possible to include transformer layers in the library, that would be great for the research team. And note that transformers are trendy in recent months.
Yeah, it would be cool to have frugally-deep support such models. :+1:
The following layer types would suffice, right?
-
tf.keras.layers.LayerNormalization
-
tf.keras.layers.UnitNormalization
-
tf.keras.layers.AdditiveAttention
-
tf.keras.layers.Attention
-
tf.keras.layers.MultiHeadAttention
I would like to help you if you need any. :-)
Yeah, that's what we really need, also the Embading layer, but it's already in place so thank you!!
@ahmed-masud That sounds great! :heart: Which layer types would you be interested in implementing? I'd be happy about a pull request for any of the net-yet-supported layer types. If I can be of any help to get you started, please let me know. :slightly_smiling_face:
I think I will probably take a crack at tf.keras.layers.Attention
seems to be the simplest :-)
Hi @Dobiasd, I would be happy to contribute since this will be very helpful for my work - I'll take a look into the Normalization layers, and will be in touch if I have any luck:)
Ok, I've implemented support for the Attention
layer (only with use_scale=False
and score_mode='dot'
), and will see if I can do AdditiveAttention
and MultiHeadAttention
too. :mechanic:
But if anyone wants to take over, I'd be happy. Just let me know when you do, so we avoid duplicate work. :slightly_smiling_face:
Just to keep you updated: I just added support for use_scale = True
and score_mode = 'concat'
to our Attention
layer implementation. :heavy_check_mark:
Update:
- I added support for
AdditiveAttention
support some time ago. :white_check_mark: - And currently, I'm trying to understand/implement
MultiHeadAttention
, which looks quite complex to me, not only because of the call-convention quirk. :sweat_smile:
FYI I'm just a curious bystander but I'd love to see your LayerNorm implementation or if you had any ideas for how you would implement it.
Hi @sevagh: If you'd like to give the implementation of LayerNormalization
a how I suggest first having a look at "How to use custom layers?" in the FAQ. Implementing a new layer (existing TF layer) is very similar to implementing a custom layer. If you have any specific questions, please let me know. Also feel free to push your unfinished implementation as a draft Pull Request, so we can look at it together. :slightly_smiling_face:
My project is slightly different (I'm implementing a PyTorch neural network using Eigen/C++). As such none of my code is directly applicable to this repo, but I occasionally look at your code to learn Eigen syntax and tricks.
In my particular case, the issue in my naive layer_norm was the unstable variance calculation for the input, leading to different output compared to the PyTorch LayerNorm.
Using Welford's algorithm helped improve my results.
💡 Ah, thanks. 👍 I'll try to implement this and let you know when done.
BTW I will be happy to share my code as soon as it's ready. Although like I said, the goals are different from frugally-deep, I have implemented (out of necessity for my application) a lot of transformer operations e.g. Multi-head attention
Thanks. Your MultiHeadAttention
might be interesting, since I've not yet managed to understand the TensorFlow implementation well enough to implement it in frugally-deep too.
Yesterday and today I worked on implementing LayerNormalization
. But I first need to implement the needed prerequisites like multi-axes moments (mean and variance) and extend my BatchNormalization
implementation, so that I can forward to it from LayerNormalization
.
I'll ping you when it's finished.
FYI: LayerNormalization
is now available in frugally-deep. :tada:
FYI: UnitNormalization
is now also available in frugally-deep. :tada:
(Only MultiHeadAttention
is still missing. :mechanic:)
Here's the copy-pasted subsection of multi-head attention from my code: https://gist.github.com/sevagh/b71d253a347a9b59c026580625452fc5
It's messy but I hope part of it helps. Let me know
By the way, my project is now released, so I can link more of the crosstransformer:
- https://github.com/sevagh/demucs.cpp/blob/main/src/crosstransformer.cpp
- https://github.com/sevagh/demucs.cpp/blob/main/src/crosstransformer.hpp
- These rely on the common encoder layer (same as I linked above but optimized): https://github.com/sevagh/demucs.cpp/blob/main/src/layers.cpp#L334
These correspond to the following transformer classes in PyTorch in the Demucs model: https://github.com/facebookresearch/demucs/blob/main/demucs/transformer.py
https://github.com/facebookresearch/demucs/blob/main/demucs/transformer.py#L271 https://github.com/facebookresearch/demucs/blob/main/demucs/transformer.py#L380 https://github.com/facebookresearch/demucs/blob/main/demucs/transformer.py#L526
These build on (but are not directly) the base Transformer classes of PyTorch, but some of it might be helpful!
Support for MultiHeadAttention
layers is now implemented in the latest version of frugally-deep. :tada: