Peter
Peter
Is that argument `x` needed? It seems we can just define `within_gradient() = false` since we didn't (or can't?) detect whether the function is differentiated wrt `x`.
I would say `is_deriving(x)`/`is_differentiating(x)` is kind of weird for non-tracker AD. It sounds like you are checking whether the pullback get a `NoTangent` as being non-differentiable. Actually, that means this...
``` In fact it looks like Yota is smart enough to do that: julia> Yota.grad(x -> within_gradient(x) ? x^2 : x, 2.0) (2.0, (ZeroTangent(), 1)) ``` I am a little...
Sorry for barging in, but I'm quite curious about the idea of serving dataset with package server system. Wouldn't that be too much for the package server to cache? I...
I have [one](https://github.com/chengchingwen/Transformers.jl/tree/master/example/AttentionIsAllYouNeed) in [Transformers.jl](https://github.com/chengchingwen/Transformers.jl)
Sure. I'm also thinking about open a model zoo for Transformers.jl itself, since there are other models like gpt or bert.
@ToucheSir Could you try running the layer norm gradient with gpu? I have try that manual broadcast fusion before but `CUDA.time` said it actually allocated more gpu memory
I think something need to be mentioned together with Embedding is the one-hot encoding implementation. The problem of Embedding/OneHotEncoding is to maintain semantics and composability without hurting the performance on...
@CarloLucibello I would like to add Einstein summation and tensor product to the discussion list. They are quite useful in some novel model design.
Sounds good to have! HF handle it in the forward method of hf-models (equiv. `Layers.Transformer`). I'm not sure `Checkedpointed` as `AbstractTransformerBlock` is the best place to add the checkpoint functionality....