neural-fortran
neural-fortran copied to clipboard
Fc2d layer
Fully-Connected Layer for 2D Shapes
Also known as MLP, FeedForward, etc. A common component of neural networks, including transformers. The idea is very simple: first linear transformation => activation => second linear transformation.
This is the last piece of tranformer architecture. When #203, #205 and this one are merged. We can start adding transformer encoders and decoders.
Python reference: https://github.com/OneAdder/neural-fortran-references/blob/main/fc2d_layer.py
Problem
Softmax derivative here is incorrect. This implementation is actually prime of logistic function which does not equivalent to softmax. Derivative of softmax w.r.t. to each element in input requires computation of Jacobian matrix:
$jacobian_{i, j} = \pmatrix{\frac{dsofmax_1}{dx_1} & ... & \frac{dsofmax_1}{dx_j} \cr ... & ... & ... \cr\frac{dsofmax_i}{dx_1} & ... & \frac{dsofmax_i}{dx_j} }$ $\frac{dsoftmax}{dx} = gradient \times jacobian$
Where:
-
$\frac{dsoftmax_i}{x_j} = softmax(x_j) \cdot (\alpha - softmax(x_i))$ where $\alpha$ is $1$ for $i = j$, $0$ otherwise
-
$x$ is the input sequence
Similar to my implementation for MultiHead Attention here.
Possible Solutions
It is not easy to resolve as activation_function doesn't accept input, so:
- Do nothing, I added crutch that throws an error when
softmaxis passed as activation - Make softmax a layer without parameters rather than an activation function, this will work
- Make a wrapper
activation_layerthat extendsbase_layerand accepts activation function
@OneAdder Please forgive my ignorance here. Could you please clarify the distinction between the fc2d layer and the dense layer?
@jvdp1 The terms are not particularly well defined here in practice. This is also sometimes called dense. The mathematical distinction is that dense in neural-fortran is linear transformation => activation while my fc2d is linear transformation => activation => linear transformation. Theoretically the same as dense(some_activation) => dense(linear_activation).
The key difference here is from software development perspective. fc2d works with 2D shape. dense can't handle those
Thanks @OneAdder for starting this. From your explanation I understand what this does.
Rather than introducing a composition of multiple operations as a single layer, I suggest that we build a basic building block first, and then if needed, we can add a "shallow-wrapper" layer around those elementary layers.
Specifically, rather than introducing here a new layer that does "first linear transformation => activation => second linear transformation", I suggest we simply introduce a dense2d layer which is the same as dense but that works on 2-d inputs.
Then, the operation proposed here would be: dense2d(activation) => dense2d(linear). We already have a linear activation function which allows using existing dense layers as linear layers. What do you think?
And thanks for pointing out the incorrect softmax derivative. I don't even recall how and why I did that.
@milancurcic It makes sense. I can do it. Should we merge this and then refactor it or the other way around?
BTW, I think we should actually make a consistent API for combined layers. Something along the lines of the following: base_layer is inherited by combined_layer which implements get and set params and gradients methods which point to the params of the layers that make up the combined layer. Combined layers extend combined_layer class
Thanks, @OneAdder. If you agree, I suggest that here we simply provide a 2-d version of an existing dense layer (I suggest dense2d) which accepts an activation function as well as a linear activation as a special case.
Good ideas for combined_layer but let's discuss it in a separate issue. I opened #217.
Actually, since we already have linear2d, should we just refactor it to accept an activation function and thus call it dense2d? Then, creating dense2d(..., activation="linear") would give us linear2d.
@milancurcic on it