clpy
clpy copied to clipboard
Why ClPy is slower than CuPy even if on the same machine?
Accoring to performance report, ClPy is slower than CuPy on both of TITAN V and GeForce 1060.
We need to know the reason (and improve if we can).
#163 reported clWaitForEvents
was one of the problems.
This article (in Japanese) could help us https://qiita.com/shu65/items/42914bd2ad01d1e323da
@vorj I hear you're trying this.
@LWisteria
I ran Chainer's (3.3.0) example code train_mnist.py
and profiled trainer.run()
with cProfile. All I could tell from the result was that the function spends most of its time waiting for I/O completion. Is there any way to further track down the bottleneck?
https://github.com/chainer/chainer/blob/a01c33f515ec23d33eab6e2eff6aa5d4e99f414d/examples/mnist/train_mnist.py#L122
(1 epoch, 500 iterations)
ncalls tottime percall cumtime percall filename:lineno(function)
145 0.621 0.004 0.621 0.004 {method 'poll' of 'select.poll' objects}
199 0.238 0.001 0.238 0.001 {built-in method posix.read}
27 0.220 0.008 0.220 0.008 {built-in method _posixsubprocess.fork_exec}
4000 0.214 0.000 0.214 0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
3000 0.165 0.000 0.244 0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
2000 0.157 0.000 0.247 0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
3500 0.140 0.000 0.356 0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
2500 0.117 0.000 0.263 0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
8500 0.090 0.000 1.962 0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
3 0.080 0.027 0.080 0.027 {method 'normal' of 'mtrand.RandomState' objects}
Environment: OS: Linux version 4.15.0-54-generic (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) GPU: NVIDIA Corporation GV100 [TITAN V] Python 3.6.5
The result above shows the 10 functions with the longest "total time"s (tottime
) and does not cover the all functions called.
It seems at least we cannot use nvvp's OpenCL profiling functions in our NVIDIA GPU environments. A post on NIVIDIA forum states they do not support OpenCL profiling anymore in versions >= 8. https://devtalk.nvidia.com/default/topic/1030699/visual-profiler-and-nvprof/opencl-profiling-using-nvvp-command-line-profiler/post/5245061/#5245061 (@vorj pointed this out. Thanks.)
@ybsh I found 2 profilers for AMD GPUs. I haven't tried yet, but they seem to be supporting OpenCL kernels profiling. If you don't need to profile against NVIDIA GPUs, you can try those out.
Radeon GPU Profiler
Radeon Compute Profiler
@vorj @ybsh we need NVIDIA GPU because we need to know why clpy is slower than cupy
If we have no way to profile on NVIDIA GPUs, you must use other profiling tools for Python and Cython itself.
Thank you again for your help @vorj , I bumped into two "performance tools" which are linked to by NVIDIA's website . Do these look like what we are looking for? Both seem to "support NVIDIA GPUs and OpenCL". TAU Vampir Since I'm new both to graphic and GPU profilers, I really appreciate your opinions. @LWisteria @vorj
I'm not soooo good at OpenCL profiling itself. Why don't you just try it?
Before going into graphic/GPGPU tools, I have profiled train_mnist.py
for CuPy/ClPy (again on titanv, with cProfile), this time with the same number of epochs and iterations as the performance report.
CuPy
- total execution time
real 52.27
user 51.63
sys 1.81
- top 20
tottime
s
ncalls tottime percall cumtime percall filename:lineno(function)
72000 2.432 0.000 2.931 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
80 2.189 0.027 2.189 0.027 {built-in method matplotlib._png.write_png}
218000 2.171 0.000 20.920 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
70006 2.156 0.000 2.298 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/cupy/core/fusion.py:693(__call__)
102000 2.138 0.000 2.138 0.000 {method 'dot' of 'cupy.core.core.ndarray' objects}
52080 2.047 0.000 2.232 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
50000 1.558 0.000 1.614 0.000 {method 'sum' of 'cupy.core.core.ndarray' objects}
42003 1.335 0.000 2.295 0.000 {built-in method cupy.core.core.array}
12000 1.305 0.000 11.798 0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
33340 1.226 0.000 1.226 0.000 {built-in method numpy.concatenate}
14000 1.086 0.000 4.311 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
42000 1.008 0.000 2.259 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
24000 0.832 0.000 0.833 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
372001 0.822 0.000 2.815 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)
14000 0.783 0.000 0.783 0.000 {built-in method cupy.cudnn.get_handle}
12000 0.637 0.000 0.683 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
253111 0.625 0.000 0.864 0.000 {built-in method numpy.array}
399999 0.596 0.000 1.090 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:160(__init__)
4550436/4550431 0.582 0.000 0.768 0.000 {built-in method builtins.isinstance}
1202000 0.581 0.000 1.295 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/datasets/tuple_dataset.py:41(__getitem__)
ClPy
- total execution time
real 77.34
user 75.85
sys 2.42
- top 20
tottimes
s
ncalls tottime percall cumtime percall filename:lineno(function)
102000 4.765 0.000 4.765 0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
72000 4.505 0.000 9.484 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
52080 3.873 0.000 4.493 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
98000 3.491 0.000 5.770 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
64000 2.670 0.000 4.281 0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
752502 2.434 0.000 4.534 0.000 {built-in method numpy.array}
999976/983678 2.380 0.000 3.336 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
80 2.179 0.027 2.179 0.027 {built-in method matplotlib._png.write_png}
218000 2.168 0.000 34.324 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
582836 2.100 0.000 2.100 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)
42003 1.684 0.000 2.333 0.000 {built-in method clpy.core.core.array}
14000 1.488 0.000 9.758 0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
42000 1.446 0.000 3.957 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
12000 1.390 0.000 17.243 0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
582836 1.303 0.000 8.662 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:274(_get_void_ptr)
582836 1.218 0.000 1.218 0.000 (HOMEDIR)/.pyenv/versions/3.6.5/lib/python3.6/ctypes/__init__.py:484(cast)
582836 1.105 0.000 1.105 0.000 {method 'from_buffer' of '_ctypes.PyCArrayType' objects}
12000 1.083 0.000 2.061 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
24000 1.021 0.000 1.070 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
372001 0.881 0.000 2.972 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)
2 observations:
- The main 'bottleneck' I have reported in the previous result turns out to be less important in larger workload sizes.
- Those two groups have many common codepaths, and they are overall more preformant in CuPy. We cannot ascribe the performance gap to just a few codepaths.
I think I should try samples with more conspicuous performance gaps, such as word2vec (CuPy: 192s, ClPy: 28s).
@LWisteria I ran Chainer's train_word2vec.py
on titanv with the default epochs/iterations (i.e. the same as the performance report), but the execution times were much longer than the reported figures even without cProfile enabled.
Do you think I have misconfigured/misinterpreted something?
(CuPy [s] / ClPy [s])
- Reported: 27.84 / 192.12
- My try: 94.62 / 503.71
@ybsh I don't know why but I think you don't need to know the reason to solve this issue. Do you need?
@LWisteria Maybe no for right now, but maybe yes in the long run. I thought it was a good thing to find hidden configuration differences (if there are) for assuring reproducibility. At least the benchmark itself is working, so I am currently working on bottleneck trackdown.
@ybsh Now you confirmed the fact "ClPy is slower than CuPy", even with other configs/settings than the last report. So please keep on doing your work with your configs/setting!
Profiled train_word2vec.py
with cProfile on titanv, epoch = 1.
CuPy
Execution time: 5.791 s
- top 10
tottime
ncalls tottime percall cumtime percall filename:lineno(function)
1004 4.133 0.004 4.156 0.004 {method 'get' of 'cupy.core.core.ndarray' objects}
1004 0.086 0.000 4.297 0.004 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
1 0.085 0.085 0.085 0.085 {built-in method cupy.cuda.nvrtc.compileProgram}
6880 0.078 0.000 4.886 0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
1860 0.078 0.000 0.096 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
2008 0.075 0.000 0.142 0.000 {built-in method cupy.core.core.array}
1004 0.068 0.000 0.167 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
1005 0.051 0.000 0.063 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
930 0.048 0.000 0.364 0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
2008 0.041 0.000 0.041 0.000 {method 'take' of 'numpy.ndarray' objects}
ClPy
Execution time: 25.878 s
- top 10
tottime
ncalls tottime percall cumtime percall filename:lineno(function)
2008 21.344 0.011 21.388 0.011 {built-in method clpy.core.core.array}
1004 1.696 0.002 1.711 0.002 {method 'get' of 'clpy.core.core.ndarray' objects}
74 0.363 0.005 0.363 0.005 {method 'poll' of 'select.poll' objects}
3724 0.185 0.000 0.360 0.000 {method 'fill' of 'clpy.core.core.ndarray' objects}
1004 0.140 0.000 2.106 0.002 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
1860 0.133 0.000 0.297 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
1005 0.100 0.000 0.230 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
102 0.097 0.001 0.097 0.001 {built-in method posix.read}
1004 0.096 0.000 0.191 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
25013 0.092 0.000 0.172 0.000 {built-in method numpy.array}
23005 0.080 0.000 0.080 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)
This shows a great performance difference in {built-in method clpy.core.core.array}
. Where is this called from?
- top 10
cumtime
(runtime including callees)
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.011 0.011 25.878 25.878 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
930 0.008 0.000 25.020 0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
930 0.004 0.000 25.012 0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
1004 0.002 0.000 21.411 0.021 <string>:171(convert)
2008 0.006 0.000 21.409 0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
2008 0.006 0.000 21.397 0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
2008 0.001 0.000 21.389 0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
2008 21.344 0.011 21.388 0.011 {built-in method clpy.core.core.array}
930 0.016 0.000 3.837 0.004 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)
6880 0.076 0.000 2.917 0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
The top 8 show the effective call stack.
Where is this called from?
You can use graphviz to make call tree from cprofile results.
Code location: https://github.com/fixstars/clpy/blob/3a97570b87ad57e85e837678c863b62cda6e4513/clpy/core/core.pyx#L2153 Now I'm going on to Cython-level profiling.
With a Cython-level profiling, it turned out to be copy_from_host()
that train_word2vec.py
was spending most of its time in ClPy.
Again a Cython output, sorted by cumulative execution time:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.009 0.009 10.067 10.067 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
930 0.004 0.000 9.600 0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
930 0.004 0.000 9.587 0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
1004 0.001 0.000 6.895 0.007 <string>:171(convert)
2008 0.005 0.000 6.893 0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
2008 0.006 0.000 6.882 0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
2008 0.001 0.000 6.875 0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
2008 0.001 0.000 6.874 0.003 {built-in method clpy.core.core.array}
2008 0.001 0.000 6.872 0.003 clpy/core/core.pyx:2154(array (wrapper))
2008 0.008 0.000 6.871 0.003 clpy/core/core.pyx:2154(array)
2008 0.013 0.000 6.840 0.003 clpy/core/core.pyx:1734(set)
2008 6.780 0.003 6.781 0.003 clpy/backend/memory.pyx:354(copy_from_host)
930 0.012 0.000 2.609 0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)
to_gpu()
is called 2008 times, and in every call copy_from_host() is called once.
As you can see in the previous report,
CuPy's call count of {built-in method clpy.core.core.array} is also 2008.
Sidenote: The total execution time is around 10s, which is half as long as the previous report and I have no idea why this is happening...
How I observed this
Add this line
# cython: profile=True
to the head of
-
clpy/backend/memory.pyx
(location ofarray()
) -
clpy/core/core.pyx
(location ofcopy_from_host()
) , and rebuild ClPy bypython setup.py install
copy_from_host()
could report its processing time including other processes'. As you know about OpenCL, clEnqueueWrite/ReadBuffer()
waits to finish all tasks in the CommandQueue. You need to separate such waiting time from transferring time itself.
Thank you. Perhaps it is time to turn to OpenCL tracing tools, because it seems very hard to find out what is happening inside only with Cython profiling tools...
You can insert clQueueFinish() explicitly before enqueuing read/write.
@LWisteria OK, I will try that and see if the transfers are taking long or not.
Added clFinish()
before the clEnququeWriteBuffer
in memory.pyx:354(copy_from_host)
.
As @LWisteria has pointed out its possibility, clEnququeWriteBuffer
was almost idle waiting for previous tasks 😇
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.009 0.009 9.941 9.941 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
930 0.004 0.000 9.447 0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
930 0.004 0.000 9.435 0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
1004 0.002 0.000 6.740 0.007 <string>:171(convert)
2008 0.005 0.000 6.738 0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
2008 0.005 0.000 6.727 0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
2008 0.001 0.000 6.720 0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
2008 0.001 0.000 6.719 0.003 {built-in method clpy.core.core.array}
2008 0.001 0.000 6.718 0.003 clpy/core/core.pyx:2154(array (wrapper))
2008 0.008 0.000 6.717 0.003 clpy/core/core.pyx:2154(array)
2008 0.012 0.000 6.685 0.003 clpy/core/core.pyx:1734(set)
2008 0.005 0.000 6.628 0.003 clpy/backend/memory.pyx:354(copy_from_host)
2008 6.598 0.003 6.598 0.003 clpy/backend/opencl/api.pyx:306(Finish)
930 0.012 0.000 2.610 0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)
Now that the data transfers proved not to be the culprit, I'll try OpenCL tracing tools.
Can't you get function calling order? The calling order is enough to check what command was enqueued and consumes time.
I'm trying Python's trace
module.
As I mentioned above, I tried Python's trace
module configuring all the options except for file I/O related ones, but there seems to be no effective way to list the order of ClPy-level function calls (it does not even get any information inside calls of functions defined with Cython).
Looking for another tool that looks handier than GUI tools...