JulioHC00
JulioHC00
@CloseChoice Thanks for the paper! I've had a quick look, and it looks like it may be useful. I'm getting mixed results with the `nonlinear_1d` solution, and it just completely...
@CloseChoice I'll have a try at doing that, though I'm still trying to figure out what needs impementing exactly. There's the taylor expansion of the LayerNorm operation, but what does...
I gave it a try with this ```python def compute_partial_derivative_matrix_torch(x, alpha, beta, epsilon): import torch B, N = x.shape # Compute the mean and variance along the feature dimension (N)...
@CloseChoice Will do! If I have time I'll try it later today, if not tomorrow morning and I'll let you know how it goes. Thanks!
@CloseChoice Quick update. It does seem to solver the prblem and it works with my example. I'll add it as a test and open a PR. The model at #3881...
I'll try to check these. They definitely not appear as children as I checked earlier today. Is the issue then that Pytorch is somehow not managing their gradients in the...
@CloseChoice Sounds good! I've made a PR #3890 for the LayerNorm part (also added Identity as passthrough). Let me know if it looks ok.
I've been doing some tests and it does seem something like `transpose` will cause issues. I guess `transpose` is just a less flexible `permute` so it all makes sense.