thinc
thinc copied to clipboard
Implementing Leaky Relu, parametric and other forms of Relu
I am working on an implementation of LeakyRelu. I would like some input on how to go about with the implementation of the same. There are 2 options.
- A separate layer named LeakyRelu, ParamRelu etc for each of the relu variations.
- One single Relu layer that instead takes optional params and implements them. (This would greatly reduce duplicated code, but also does reduce the visibility of the layer to end users unless they spend some time on the documentation).
Keras and Pytorch seem to have separate layers for each of the Relu variations, but I am inclined more towards a single relu with the right parameters. What would you guys suggest?
Thanks for the question, I think it's definitely something to think about.
Currently in Thinc we use a single layer definition for the weights and the activation. This helps to set the initialization defaults a little bit smarter, because the choice of activation typically impacts the best initialization strategy. This does make it awkward to keep accumulating these activation variants though.
- How many variants would we want?
- Are they all important to have, or are some strictly inferior?
- Do people mostly think of them as the same activation (relu), or do they think of it as a different thing?
Another awkward problem with putting it all in the Relu
layer is defaults. Presumably if people do use LeakyRelu
they mostly use the same leak parameter, right? We can't have a helpful default for that if we instead default that parameter to 0. And I don't want to have both a flag and a separate parameter for the leak.
@honnibal The regular Relu is only a special case of the Leaky Relu where the alpha parameter is 0. So what I have done as of now, is kept the default as 0. When they do need a leakyRely, what they do is
Relu(alphaLeaky=0.1)
But again, this might bloat up or conflict if in the future someone would like to implement the other Relus or some future variations to come for that matter. https://keras.io/layers/advanced-activations/
Yeah keeping both the flag and Param is a very terrible idea, in some cases we can do away with an explicit flag and use the params to make an inference, but again, things might conflict in the future.
Hi @honnibal any update on this? Would love to complete this with all the extra time the lockdowns are giving us.
Hey @naveenjafer,
We have not implemented parametric ReLU
functions, but added a bunch of activations since:
-
Swish
-
Gelu
-
Dish
(this is our custom more efficient Swish usingsqrt
instead ofexp
) -
HardSwish
-
HardSwishMobilenet
-
HardSigmoid
-
HardTanh