rational_activations
                                
                                
                                
                                    rational_activations copied to clipboard
                            
                            
                            
                        Comparison to Mish activation
Mish is the current most popular activation function. Thus, it would be good if you can also compare with it.
Yes, we want to compare to it too, I'll upload updated graphs soon.
@k4ntz It performed better than Mish in my case (also RL-like)
Any tips on how to active? I used kaiming_uniform_ for Linear layers.
Hi @kayuksel, thanks for these info ! Could you share some graph or link to result (even draft) showing that. We are also working for comparison against GeLU in transformers. For the initialisation, you can use xavier as in our imagenet classification task (https://github.com/ml-research/rational_sl/blob/main/imagenet/train_imagenet.py) and cifar (https://github.com/ml-research/rational_sl/blob/main/cifar/train.py). If I remember correctly, it empirically works better than kaiming (for big nets). Please share any other result you might have !
@k4ntz I can provide the learning curves using both but I am unable to provide a draft publication yet as it is patent-pending. Thanks for suggesting the initialization function, I will then also try with that.
@k4ntz FYI, my case is an adversarial setting so it is important that it re-adapt itself continuously. Thus, having also adaptable activation functions (besides the network itself) may be good in that sense.
Also, I prefer the network to overfit so the generalization is not a major concern (in case using an adaptable activation function may increase the chance of over-fitting for the model for certain tasks)
Yes, using such activation functions provides the network with more modelling capacities, the transformation of the manifold through the layers might be more accurate if needed. I don't think Rational AFs overfit per se, but if they are used in network cherry pick to well perform at a task, this additional modelling power provides overfitting. We are also working on pruning rational nets. Whenever results are available, they will be provided. :)