jumerckx comments

Results 9 comments of


                                            jumerckx

Slower performance on GPU than CPU with Zygote

More seems to be going on when using Zygote. In this example Tullio on gpu beats the cpu version in the forward pass, but the gradient calculation goes much slower....

Slower performance on GPU than CPU with Zygote

What actually causes the 4x difference between forward and backwards? Is it 4x more computation or is there a large overhead? I don't really know what the variables in Tullio's...

Error when taking gradient of gpu expression

You're completely right, thanks for the explanation.

Next steps for notebooks and tutorials

I've tried to implement [PyTorch's seq2seq translation tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) (for the model itself I've also looked at [Fast.ai's nmt lesson](https://course.fast.ai/lessons/lesson11.html)). I've written my process down in [this](https://nbviewer.jupyter.org/github/merckxiaan/flux-seq2seq/blob/master/seq2seq%20in%20flux.ipynb#The-Model) notebook The model's performance...

Seq2seq tutorial

Thanks for the feedback, I've removed the notebook version. As for the translation quality, I'm afraid I'm unable to fix this, although I'm pretty certain I'm making a silly mistake...

Slow ∇softmax! compared with generic version.

> > > We could try with forcing using the CUDNN kernel directly as a debugging step, sidestepping the case of contiguous dimensions. > Indeed, just using `_∇softmax!` directly and...

Slow ∇softmax! compared with generic version.

I can't run Nvidia Nsight profiling on my machine but I timed all the lines which does confirm it is the kernel evaluation that causes the slowdown. ```julia # ......

Slow ∇softmax! compared with generic version.

By default it's `CUDNN_SOFTMAX_ACCURATE` but using `CUDA.math_mode!(CUDA.FAST_MATH)` to use `CUDNN_SOFTMAX_ACCURATE` doesn't lead to a discernible difference in timing

Slow ∇softmax! compared with generic version.

> > > it's so weird CUDNN is so slow. Anyways, yes, if @jumerckx can compare performance on a few more size and results are the same we can remove...