LovaszSoftmax
LovaszSoftmax copied to clipboard
Very slow inference in tensorflow
before i use your loss function, 2.5sec/step after i use your loss function, 32.0sec/step
i use tensorflow 1.6.0
That slowdown seems quite drastic, do you have e.g. many categories or images to evaluate in one step? I suspect that dedicated CUDA kernel would speed up the implementation a lot (better than the large succession of the masking/selecting/sort operations). However I don't plan to tackle this in the near future - contributions in this direction are welcome.
I just use it on cityscapes dataset. I use distributed computing. model is deeplab v3+
I did some profiling. It seems in tensorflow the tf.cumsum
operation is extremely slow on GPU, and takes a huge amount of time (~99% of the total time).
In Pytorch, as expected, the sort
operation is the one that takes the most time, cumsum
is virtually instant on GPU.
I will investigate a bit more, it might mandate an issue report for tensorflow.
This python notebook summarizes the problem: tensorflow/profile_ops.ipynb
The cumsum
operation is ~4000x slower in tensorflow vs pytorch for typical number of pixels/batch.
https://github.com/tensorflow/tensorflow/issues/813 mention that current implementations of cumsum in Eigen (tensorflow) is naïve. A solution would be to write a custom cuda op for this operation. Pointers for cumsum
on gpu are given on https://stackoverflow.com/a/25251434/805502.
After looking more into it, it seems the easiest way is to create a custom tensorflow op using cub exclusive sum instead of the native tf.Cumsum
operation. Note that there are already operators defined in tensorflow using cub
, e.g. topK, so it shouldn't be too difficult to implement this.
I will not implement this for now as I'm mainly using pytorch - I might do it one day but in the meantime I'll tag this as contributions welcome
.
The speed of cumsum has been improved significantly; I'm going to close this. Feel free to re-open if you feel it still isn't fast enough.
@ekelsen Which version of tensor flow has these improvements?
Currently just HEAD: https://github.com/tensorflow/tensorflow/commit/73e3215c3a2edadbf9111cca44ab3d5ca146c327
Thanks for the pointer @ekelsen. Closing this issue
@ekelsen , hello, I'm using Keras(backend: tensorflow 1.12) and cuda9.0, but the train speed is still slow with this loss function. Can you give me advice? My GPU is GTX 1080Ti
@stillwaterman I expect the build of Tensorflow you are using, was made before the changes to cumsum were implemented. Building Tensorflow from source, might be a reasonable option to expedite training.
@jianlong-yuan hi~I want to use Lovász-Softmax loss in deeplab v3+ but failed. Could you give me some reference or demos? Thanks.
@Z-Ianthe How to solve the abouve problems, i put here https://github.com/jianlong-yuan/LovaszSoftmax_tf/tree/master
I don't have time to investivate into tensorflow issues for now but I am at least reopening the issue.