Improve activations' implementations using Eigen better
Tanh uses a lot of compute (hence "fast tanh"). But this e.g.:
https://github.com/sdatkinson/NeuralAmpModelerCore/blob/846968710a670d662b15e449edba852d747d748e/NAM/activations.h#L75-L81
should be able to be implemented more idiomatically with Eigen.
It would also be nice for any change related to this Issue have some profiling results attached since this is a performance Issue.
It looks like Eigen can do this via what it calls "packet operations" and support AVX512 and other enhanced instruction sets on the hardware you are running on automatically (I'm guessing based on a quick perusal of their code). This would be a good way to get some parallel implementation on machines that support it.
I wasn't sure from your code whether anything ever calls directly into the version of apply that takes the C-style pointers, or if everything goes through the overloads with Eigen parameters. If they always use the eigen overloads, you might consider making this c-style function private or protected, so you don't need to use it for the places where you have direct eigen implementations for the activations.
I revisited this a couple of months ago.
Eign's built-in tanh uses an approximation (at least on some platforms) that is more accurate than the existing NAM fast tanh.
My recollection is that when I benchmarked it, it performed somewhat worse than the NAM fast tanh on my Windows PC and much worse on Arm (Raspberry Pi).
This stuff is highly dependent on architecture and compiler.
I decided that (for me at least) the accuracy improvement didn't seem to be worth the performance cost. But I'm operating in environments where CPU resources are limited. It might make sense in other situations (although then you might want to just use a straight-up tanh...)
Interesting... I suspect that if you want the same perf as NAM fast tanh, it would be written as a vectorized calc itself (since it is a pretty simple function that looks to be easily vectorizable). I'd expect that to have the same perf if written with Eigen as the code that currently exists (except that it will do AVX/etc. instructions when available and might perf better)