ybsh comments

Results 26 comments of


                                            ybsh

Performance improvement of _launch (code block 2: packing CArray)

I'm trying to pin down the bottleneck lines by placing probes (```time.perf_counter()```) densely in the code block.

Performance improvement of _launch (code block 2: packing CArray)

I created a new branch ```280-for-profile-launch-cb2``` off ```153-for-profile```. I added probes as follows: https://github.com/fixstars/clpy/commit/ba580c301fca I ran ```train_mnist.py``` (100 iterations). Total execution time of this code block (```ndarray_time```): 0.258964 s ```...

Performance improvement of _launch (code block 2: packing CArray)

Executed 3 more times, and these five execution times do not differ much (the differences are at most about +/-10%). 1st trial | 2nd trial | 3rd trial -- |...

Performance improvement of _launch (code block 2: packing CArray)

```arrayInfo.offset = a.data.cl_mem_offset()``` takes the longest. Its definition is here: https://github.com/fixstars/clpy/blob/clpy/clpy/backend/memory.pyx/#L457-L461

Why ClPy is slower than CuPy even if on the same machine?

@LWisteria I ran Chainer's (3.3.0) example code ```train_mnist.py``` and profiled ```trainer.run()``` with cProfile. All I could tell from the result was that the function spends most of its time waiting...

Why ClPy is slower than CuPy even if on the same machine?

The result above shows the 10 functions with the longest "total time"s (```tottime```) and does not cover the all functions called.

Why ClPy is slower than CuPy even if on the same machine?

It seems at least we cannot use nvvp's OpenCL profiling functions in our NVIDIA GPU environments. A post on NIVIDIA forum states they do not support OpenCL profiling anymore in...

Why ClPy is slower than CuPy even if on the same machine?

Thank you again for your help @vorj , I bumped into two "performance tools" which are linked to by [NVIDIA's website](https://developer.nvidia.com/performance-analysis-tools) . Do these look like what we are looking...

Why ClPy is slower than CuPy even if on the same machine?

Before going into graphic/GPGPU tools, I have profiled ```train_mnist.py``` for CuPy/ClPy (again on titanv, with cProfile), this time with the same number of epochs and iterations as the [performance report](https://github.com/fixstars/clpy/wiki/chainer_example_performance_report)....

Why ClPy is slower than CuPy even if on the same machine?

I think I should try samples with more conspicuous performance gaps, such as word2vec (CuPy: 192s, ClPy: 28s).