clpy Why ClPy is slower than CuPy even if on the same machine?

Accoring to performance report, ClPy is slower than CuPy on both of TITAN V and GeForce 1060.

We need to know the reason (and improve if we can).

Dec 18 '18 06:12 LWisteria

#163 reported clWaitForEvents was one of the problems.

Feb 01 '19 07:02 LWisteria

This article (in Japanese) could help us https://qiita.com/shu65/items/42914bd2ad01d1e323da

Feb 08 '19 04:02 LWisteria

@vorj I hear you're trying this.

Feb 14 '19 05:02 LWisteria

@LWisteria I ran Chainer's (3.3.0) example code train_mnist.py and profiled trainer.run() with cProfile. All I could tell from the result was that the function spends most of its time waiting for I/O completion. Is there any way to further track down the bottleneck? https://github.com/chainer/chainer/blob/a01c33f515ec23d33eab6e2eff6aa5d4e99f414d/examples/mnist/train_mnist.py#L122

(1 epoch, 500 iterations)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      145    0.621    0.004    0.621    0.004 {method 'poll' of 'select.poll' objects}
      199    0.238    0.001    0.238    0.001 {built-in method posix.read}
       27    0.220    0.008    0.220    0.008 {built-in method _posixsubprocess.fork_exec}
     4000    0.214    0.000    0.214    0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
     3000    0.165    0.000    0.244    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     2000    0.157    0.000    0.247    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
     3500    0.140    0.000    0.356    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
     2500    0.117    0.000    0.263    0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
     8500    0.090    0.000    1.962    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
        3    0.080    0.027    0.080    0.027 {method 'normal' of 'mtrand.RandomState' objects}

Environment: OS: Linux version 4.15.0-54-generic (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) GPU: NVIDIA Corporation GV100 [TITAN V] Python 3.6.5

Aug 02 '19 05:08 ybsh

The result above shows the 10 functions with the longest "total time"s (tottime) and does not cover the all functions called.

Aug 02 '19 05:08 ybsh

It seems at least we cannot use nvvp's OpenCL profiling functions in our NVIDIA GPU environments. A post on NIVIDIA forum states they do not support OpenCL profiling anymore in versions >= 8. https://devtalk.nvidia.com/default/topic/1030699/visual-profiler-and-nvprof/opencl-profiling-using-nvvp-command-line-profiler/post/5245061/#5245061 (@vorj pointed this out. Thanks.)

Aug 05 '19 05:08 ybsh

@ybsh I found 2 profilers for AMD GPUs. I haven't tried yet, but they seem to be supporting OpenCL kernels profiling. If you don't need to profile against NVIDIA GPUs, you can try those out.
Radeon GPU Profiler
Radeon Compute Profiler

Aug 05 '19 06:08 vorj

@vorj @ybsh we need NVIDIA GPU because we need to know why clpy is slower than cupy

Aug 05 '19 06:08 LWisteria

If we have no way to profile on NVIDIA GPUs, you must use other profiling tools for Python and Cython itself.

Aug 05 '19 06:08 LWisteria

Thank you again for your help @vorj , I bumped into two "performance tools" which are linked to by NVIDIA's website . Do these look like what we are looking for? Both seem to "support NVIDIA GPUs and OpenCL". TAU Vampir Since I'm new both to graphic and GPU profilers, I really appreciate your opinions. @LWisteria @vorj

Aug 05 '19 06:08 ybsh

I'm not soooo good at OpenCL profiling itself. Why don't you just try it?

Aug 05 '19 06:08 LWisteria

Before going into graphic/GPGPU tools, I have profiled train_mnist.py for CuPy/ClPy (again on titanv, with cProfile), this time with the same number of epochs and iterations as the performance report.

CuPy

total execution time

real 52.27
user 51.63
sys 1.81

top 20 tottimes

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    72000    2.432    0.000    2.931    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
       80    2.189    0.027    2.189    0.027 {built-in method matplotlib._png.write_png}
   218000    2.171    0.000   20.920    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
    70006    2.156    0.000    2.298    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/cupy/core/fusion.py:693(__call__)
   102000    2.138    0.000    2.138    0.000 {method 'dot' of 'cupy.core.core.ndarray' objects}
    52080    2.047    0.000    2.232    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
    50000    1.558    0.000    1.614    0.000 {method 'sum' of 'cupy.core.core.ndarray' objects}
    42003    1.335    0.000    2.295    0.000 {built-in method cupy.core.core.array}
    12000    1.305    0.000   11.798    0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
    33340    1.226    0.000    1.226    0.000 {built-in method numpy.concatenate}
    14000    1.086    0.000    4.311    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
    42000    1.008    0.000    2.259    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
    24000    0.832    0.000    0.833    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
   372001    0.822    0.000    2.815    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)
    14000    0.783    0.000    0.783    0.000 {built-in method cupy.cudnn.get_handle}
    12000    0.637    0.000    0.683    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
   253111    0.625    0.000    0.864    0.000 {built-in method numpy.array}
   399999    0.596    0.000    1.090    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:160(__init__)
4550436/4550431    0.582    0.000    0.768    0.000 {built-in method builtins.isinstance}
  1202000    0.581    0.000    1.295    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/datasets/tuple_dataset.py:41(__getitem__)

ClPy

total execution time

real 77.34
user 75.85
sys 2.42

top 20 tottimess

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   102000    4.765    0.000    4.765    0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
    72000    4.505    0.000    9.484    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
    52080    3.873    0.000    4.493    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
    98000    3.491    0.000    5.770    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
    64000    2.670    0.000    4.281    0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
   752502    2.434    0.000    4.534    0.000 {built-in method numpy.array}
999976/983678    2.380    0.000    3.336    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       80    2.179    0.027    2.179    0.027 {built-in method matplotlib._png.write_png}
   218000    2.168    0.000   34.324    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
   582836    2.100    0.000    2.100    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)
    42003    1.684    0.000    2.333    0.000 {built-in method clpy.core.core.array}
    14000    1.488    0.000    9.758    0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
    42000    1.446    0.000    3.957    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
    12000    1.390    0.000   17.243    0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
   582836    1.303    0.000    8.662    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:274(_get_void_ptr)
   582836    1.218    0.000    1.218    0.000 (HOMEDIR)/.pyenv/versions/3.6.5/lib/python3.6/ctypes/__init__.py:484(cast)
   582836    1.105    0.000    1.105    0.000 {method 'from_buffer' of '_ctypes.PyCArrayType' objects}
    12000    1.083    0.000    2.061    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
    24000    1.021    0.000    1.070    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
   372001    0.881    0.000    2.972    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)

2 observations:

The main 'bottleneck' I have reported in the previous result turns out to be less important in larger workload sizes.
Those two groups have many common codepaths, and they are overall more preformant in CuPy. We cannot ascribe the performance gap to just a few codepaths.

Aug 07 '19 06:08 ybsh

I think I should try samples with more conspicuous performance gaps, such as word2vec (CuPy: 192s, ClPy: 28s).

Aug 07 '19 06:08 ybsh

@LWisteria I ran Chainer's train_word2vec.py on titanv with the default epochs/iterations (i.e. the same as the performance report), but the execution times were much longer than the reported figures even without cProfile enabled. Do you think I have misconfigured/misinterpreted something?

(CuPy [s] / ClPy [s])

Reported: 27.84 / 192.12
My try: 94.62 / 503.71

Aug 07 '19 07:08 ybsh

@ybsh I don't know why but I think you don't need to know the reason to solve this issue. Do you need?

Aug 07 '19 08:08 LWisteria

@LWisteria Maybe no for right now, but maybe yes in the long run. I thought it was a good thing to find hidden configuration differences (if there are) for assuring reproducibility. At least the benchmark itself is working, so I am currently working on bottleneck trackdown.

Aug 07 '19 08:08 ybsh

@ybsh Now you confirmed the fact "ClPy is slower than CuPy", even with other configs/settings than the last report. So please keep on doing your work with your configs/setting!

Aug 07 '19 08:08 LWisteria

Profiled train_word2vec.py with cProfile on titanv, epoch = 1.

CuPy

Execution time: 5.791 s

top 10 tottime

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1004    4.133    0.004    4.156    0.004 {method 'get' of 'cupy.core.core.ndarray' objects}
     1004    0.086    0.000    4.297    0.004 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
        1    0.085    0.085    0.085    0.085 {built-in method cupy.cuda.nvrtc.compileProgram}
     6880    0.078    0.000    4.886    0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
     1860    0.078    0.000    0.096    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     2008    0.075    0.000    0.142    0.000 {built-in method cupy.core.core.array}
     1004    0.068    0.000    0.167    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
     1005    0.051    0.000    0.063    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
      930    0.048    0.000    0.364    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
     2008    0.041    0.000    0.041    0.000 {method 'take' of 'numpy.ndarray' objects}

ClPy

Execution time: 25.878 s

top 10 tottime

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2008   21.344    0.011   21.388    0.011 {built-in method clpy.core.core.array}
     1004    1.696    0.002    1.711    0.002 {method 'get' of 'clpy.core.core.ndarray' objects}
       74    0.363    0.005    0.363    0.005 {method 'poll' of 'select.poll' objects}
     3724    0.185    0.000    0.360    0.000 {method 'fill' of 'clpy.core.core.ndarray' objects}
     1004    0.140    0.000    2.106    0.002 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
     1860    0.133    0.000    0.297    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     1005    0.100    0.000    0.230    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
      102    0.097    0.001    0.097    0.001 {built-in method posix.read}
     1004    0.096    0.000    0.191    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
    25013    0.092    0.000    0.172    0.000 {built-in method numpy.array}
    23005    0.080    0.000    0.080    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)

This shows a great performance difference in {built-in method clpy.core.core.array}. Where is this called from?

top 10 cumtime (runtime including callees)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.011    0.011   25.878   25.878 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
      930    0.008    0.000   25.020    0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
      930    0.004    0.000   25.012    0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
     1004    0.002    0.000   21.411    0.021 <string>:171(convert)
     2008    0.006    0.000   21.409    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
     2008    0.006    0.000   21.397    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
     2008    0.001    0.000   21.389    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
     2008   21.344    0.011   21.388    0.011 {built-in method clpy.core.core.array}
      930    0.016    0.000    3.837    0.004 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)
     6880    0.076    0.000    2.917    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)

The top 8 show the effective call stack.

Aug 07 '19 08:08 ybsh

Where is this called from?

You can use graphviz to make call tree from cprofile results.

Aug 07 '19 08:08 LWisteria

Code location: https://github.com/fixstars/clpy/blob/3a97570b87ad57e85e837678c863b62cda6e4513/clpy/core/core.pyx#L2153 Now I'm going on to Cython-level profiling.

Aug 07 '19 09:08 ybsh

With a Cython-level profiling, it turned out to be copy_from_host() that train_word2vec.py was spending most of its time in ClPy.

Again a Cython output, sorted by cumulative execution time:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.009    0.009   10.067   10.067 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
      930    0.004    0.000    9.600    0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
      930    0.004    0.000    9.587    0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
     1004    0.001    0.000    6.895    0.007 <string>:171(convert)
     2008    0.005    0.000    6.893    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
     2008    0.006    0.000    6.882    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
     2008    0.001    0.000    6.875    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
     2008    0.001    0.000    6.874    0.003 {built-in method clpy.core.core.array}
     2008    0.001    0.000    6.872    0.003 clpy/core/core.pyx:2154(array (wrapper))
     2008    0.008    0.000    6.871    0.003 clpy/core/core.pyx:2154(array)
     2008    0.013    0.000    6.840    0.003 clpy/core/core.pyx:1734(set)
     2008    6.780    0.003    6.781    0.003 clpy/backend/memory.pyx:354(copy_from_host)
      930    0.012    0.000    2.609    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)

to_gpu() is called 2008 times, and in every call copy_from_host() is called once. As you can see in the previous report, CuPy's call count of {built-in method clpy.core.core.array} is also 2008.

Sidenote: The total execution time is around 10s, which is half as long as the previous report and I have no idea why this is happening...

How I observed this

Add this line

# cython: profile=True

to the head of

clpy/backend/memory.pyx (location of array())
clpy/core/core.pyx (location of copy_from_host()) , and rebuild ClPy by python setup.py install

Aug 09 '19 07:08 ybsh

copy_from_host() could report its processing time including other processes'. As you know about OpenCL, clEnqueueWrite/ReadBuffer() waits to finish all tasks in the CommandQueue. You need to separate such waiting time from transferring time itself.

Aug 09 '19 07:08 LWisteria

Thank you. Perhaps it is time to turn to OpenCL tracing tools, because it seems very hard to find out what is happening inside only with Cython profiling tools...

Aug 09 '19 07:08 ybsh

You can insert clQueueFinish() explicitly before enqueuing read/write.

Aug 09 '19 07:08 LWisteria

@LWisteria OK, I will try that and see if the transfers are taking long or not.

Aug 09 '19 07:08 ybsh

Added clFinish() before the clEnququeWriteBuffer in memory.pyx:354(copy_from_host). As @LWisteria has pointed out its possibility, clEnququeWriteBuffer was almost idle waiting for previous tasks 😇

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.009    0.009    9.941    9.941 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
      930    0.004    0.000    9.447    0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
      930    0.004    0.000    9.435    0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
     1004    0.002    0.000    6.740    0.007 <string>:171(convert)
     2008    0.005    0.000    6.738    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
     2008    0.005    0.000    6.727    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
     2008    0.001    0.000    6.720    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
     2008    0.001    0.000    6.719    0.003 {built-in method clpy.core.core.array}
     2008    0.001    0.000    6.718    0.003 clpy/core/core.pyx:2154(array (wrapper))
     2008    0.008    0.000    6.717    0.003 clpy/core/core.pyx:2154(array)
     2008    0.012    0.000    6.685    0.003 clpy/core/core.pyx:1734(set)
     2008    0.005    0.000    6.628    0.003 clpy/backend/memory.pyx:354(copy_from_host)
     2008    6.598    0.003    6.598    0.003 clpy/backend/opencl/api.pyx:306(Finish)
      930    0.012    0.000    2.610    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)

Aug 09 '19 08:08 ybsh

Now that the data transfers proved not to be the culprit, I'll try OpenCL tracing tools.

Aug 09 '19 08:08 ybsh

Can't you get function calling order? The calling order is enough to check what command was enqueued and consumes time.

Aug 09 '19 08:08 LWisteria

I'm trying Python's trace module.

Aug 14 '19 01:08 ybsh

As I mentioned above, I tried Python's trace module configuring all the options except for file I/O related ones, but there seems to be no effective way to list the order of ClPy-level function calls (it does not even get any information inside calls of functions defined with Cython). Looking for another tool that looks handier than GUI tools...

Aug 14 '19 03:08 ybsh

clpy clpy copied to clipboard

Why ClPy is slower than CuPy even if on the same machine?

CuPy

ClPy

CuPy

ClPy

How I observed this

clpy
clpy copied to clipboard