LovaszSoftmax icon indicating copy to clipboard operation
LovaszSoftmax copied to clipboard

Very slow inference in tensorflow

Open jianlong-yuan opened this issue 6 years ago • 14 comments

before i use your loss function, 2.5sec/step after i use your loss function, 32.0sec/step

i use tensorflow 1.6.0

jianlong-yuan avatar May 17 '18 05:05 jianlong-yuan

That slowdown seems quite drastic, do you have e.g. many categories or images to evaluate in one step? I suspect that dedicated CUDA kernel would speed up the implementation a lot (better than the large succession of the masking/selecting/sort operations). However I don't plan to tackle this in the near future - contributions in this direction are welcome.

bermanmaxim avatar May 17 '18 08:05 bermanmaxim

I just use it on cityscapes dataset. I use distributed computing. model is deeplab v3+

jianlong-yuan avatar May 17 '18 13:05 jianlong-yuan

I did some profiling. It seems in tensorflow the tf.cumsum operation is extremely slow on GPU, and takes a huge amount of time (~99% of the total time).

In Pytorch, as expected, the sort operation is the one that takes the most time, cumsum is virtually instant on GPU.

I will investigate a bit more, it might mandate an issue report for tensorflow.

bermanmaxim avatar May 19 '18 15:05 bermanmaxim

This python notebook summarizes the problem: tensorflow/profile_ops.ipynb The cumsum operation is ~4000x slower in tensorflow vs pytorch for typical number of pixels/batch. https://github.com/tensorflow/tensorflow/issues/813 mention that current implementations of cumsum in Eigen (tensorflow) is naïve. A solution would be to write a custom cuda op for this operation. Pointers for cumsum on gpu are given on https://stackoverflow.com/a/25251434/805502.

bermanmaxim avatar May 19 '18 18:05 bermanmaxim

After looking more into it, it seems the easiest way is to create a custom tensorflow op using cub exclusive sum instead of the native tf.Cumsum operation. Note that there are already operators defined in tensorflow using cub, e.g. topK, so it shouldn't be too difficult to implement this.

I will not implement this for now as I'm mainly using pytorch - I might do it one day but in the meantime I'll tag this as contributions welcome.

bermanmaxim avatar May 24 '18 20:05 bermanmaxim

The speed of cumsum has been improved significantly; I'm going to close this. Feel free to re-open if you feel it still isn't fast enough.

ekelsen avatar Nov 17 '18 05:11 ekelsen

@ekelsen Which version of tensor flow has these improvements?

ben2789 avatar Nov 18 '18 20:11 ben2789

Currently just HEAD: https://github.com/tensorflow/tensorflow/commit/73e3215c3a2edadbf9111cca44ab3d5ca146c327

ekelsen avatar Nov 18 '18 21:11 ekelsen

Thanks for the pointer @ekelsen. Closing this issue

bermanmaxim avatar Nov 19 '18 12:11 bermanmaxim

@ekelsen , hello, I'm using Keras(backend: tensorflow 1.12) and cuda9.0, but the train speed is still slow with this loss function. Can you give me advice? My GPU is GTX 1080Ti

stillwaterman avatar Jan 15 '19 10:01 stillwaterman

@stillwaterman I expect the build of Tensorflow you are using, was made before the changes to cumsum were implemented. Building Tensorflow from source, might be a reasonable option to expedite training.

ben2789 avatar Jan 15 '19 19:01 ben2789

@jianlong-yuan hi~I want to use Lovász-Softmax loss in deeplab v3+ but failed. Could you give me some reference or demos? Thanks.

Z-Ianthe avatar Mar 25 '19 02:03 Z-Ianthe

@Z-Ianthe How to solve the abouve problems, i put here https://github.com/jianlong-yuan/LovaszSoftmax_tf/tree/master

jianlong-yuan avatar Mar 25 '19 10:03 jianlong-yuan

I don't have time to investivate into tensorflow issues for now but I am at least reopening the issue.

bermanmaxim avatar Apr 09 '19 13:04 bermanmaxim