jumerckx

Results 9 comments of jumerckx

More seems to be going on when using Zygote. In this example Tullio on gpu beats the cpu version in the forward pass, but the gradient calculation goes much slower....

What actually causes the 4x difference between forward and backwards? Is it 4x more computation or is there a large overhead? I don't really know what the variables in Tullio's...

You're completely right, thanks for the explanation.

I've tried to implement [PyTorch's seq2seq translation tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) (for the model itself I've also looked at [Fast.ai's nmt lesson](https://course.fast.ai/lessons/lesson11.html)). I've written my process down in [this](https://nbviewer.jupyter.org/github/merckxiaan/flux-seq2seq/blob/master/seq2seq%20in%20flux.ipynb#The-Model) notebook The model's performance...

Thanks for the feedback, I've removed the notebook version. As for the translation quality, I'm afraid I'm unable to fix this, although I'm pretty certain I'm making a silly mistake...

> > > We could try with forcing using the CUDNN kernel directly as a debugging step, sidestepping the case of contiguous dimensions. > Indeed, just using `_∇softmax!` directly and...

I can't run Nvidia Nsight profiling on my machine but I timed all the lines which does confirm it is the kernel evaluation that causes the slowdown. ```julia # ......

By default it's `CUDNN_SOFTMAX_ACCURATE` but using `CUDA.math_mode!(CUDA.FAST_MATH)` to use `CUDNN_SOFTMAX_ACCURATE` doesn't lead to a discernible difference in timing

> > > it's so weird CUDNN is so slow. Anyways, yes, if @jumerckx can compare performance on a few more size and results are the same we can remove...