ybsh
ybsh
I'm trying to pin down the bottleneck lines by placing probes (```time.perf_counter()```) densely in the code block.
I created a new branch ```280-for-profile-launch-cb2``` off ```153-for-profile```. I added probes as follows: https://github.com/fixstars/clpy/commit/ba580c301fca I ran ```train_mnist.py``` (100 iterations). Total execution time of this code block (```ndarray_time```): 0.258964 s ```...
Executed 3 more times, and these five execution times do not differ much (the differences are at most about +/-10%). 1st trial | 2nd trial | 3rd trial -- |...
```arrayInfo.offset = a.data.cl_mem_offset()``` takes the longest. Its definition is here: https://github.com/fixstars/clpy/blob/clpy/clpy/backend/memory.pyx/#L457-L461
@LWisteria I ran Chainer's (3.3.0) example code ```train_mnist.py``` and profiled ```trainer.run()``` with cProfile. All I could tell from the result was that the function spends most of its time waiting...
The result above shows the 10 functions with the longest "total time"s (```tottime```) and does not cover the all functions called.
It seems at least we cannot use nvvp's OpenCL profiling functions in our NVIDIA GPU environments. A post on NIVIDIA forum states they do not support OpenCL profiling anymore in...
Thank you again for your help @vorj , I bumped into two "performance tools" which are linked to by [NVIDIA's website](https://developer.nvidia.com/performance-analysis-tools) . Do these look like what we are looking...
Before going into graphic/GPGPU tools, I have profiled ```train_mnist.py``` for CuPy/ClPy (again on titanv, with cProfile), this time with the same number of epochs and iterations as the [performance report](https://github.com/fixstars/clpy/wiki/chainer_example_performance_report)....
I think I should try samples with more conspicuous performance gaps, such as word2vec (CuPy: 192s, ClPy: 28s).