clpy icon indicating copy to clipboard operation
clpy copied to clipboard

Why ClPy is slower than CuPy even if on the same machine?

Open LWisteria opened this issue 6 years ago • 116 comments

Accoring to performance report, ClPy is slower than CuPy on both of TITAN V and GeForce 1060.

We need to know the reason (and improve if we can).

LWisteria avatar Dec 18 '18 06:12 LWisteria

#163 reported clWaitForEvents was one of the problems.

LWisteria avatar Feb 01 '19 07:02 LWisteria

This article (in Japanese) could help us https://qiita.com/shu65/items/42914bd2ad01d1e323da

LWisteria avatar Feb 08 '19 04:02 LWisteria

@vorj I hear you're trying this.

LWisteria avatar Feb 14 '19 05:02 LWisteria

@LWisteria I ran Chainer's (3.3.0) example code train_mnist.py and profiled trainer.run() with cProfile. All I could tell from the result was that the function spends most of its time waiting for I/O completion. Is there any way to further track down the bottleneck? https://github.com/chainer/chainer/blob/a01c33f515ec23d33eab6e2eff6aa5d4e99f414d/examples/mnist/train_mnist.py#L122

(1 epoch, 500 iterations)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      145    0.621    0.004    0.621    0.004 {method 'poll' of 'select.poll' objects}
      199    0.238    0.001    0.238    0.001 {built-in method posix.read}
       27    0.220    0.008    0.220    0.008 {built-in method _posixsubprocess.fork_exec}
     4000    0.214    0.000    0.214    0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
     3000    0.165    0.000    0.244    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     2000    0.157    0.000    0.247    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
     3500    0.140    0.000    0.356    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
     2500    0.117    0.000    0.263    0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
     8500    0.090    0.000    1.962    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
        3    0.080    0.027    0.080    0.027 {method 'normal' of 'mtrand.RandomState' objects}

Environment: OS: Linux version 4.15.0-54-generic (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) GPU: NVIDIA Corporation GV100 [TITAN V] Python 3.6.5

ybsh avatar Aug 02 '19 05:08 ybsh

The result above shows the 10 functions with the longest "total time"s (tottime) and does not cover the all functions called.

ybsh avatar Aug 02 '19 05:08 ybsh

It seems at least we cannot use nvvp's OpenCL profiling functions in our NVIDIA GPU environments. A post on NIVIDIA forum states they do not support OpenCL profiling anymore in versions >= 8. https://devtalk.nvidia.com/default/topic/1030699/visual-profiler-and-nvprof/opencl-profiling-using-nvvp-command-line-profiler/post/5245061/#5245061 (@vorj pointed this out. Thanks.)

ybsh avatar Aug 05 '19 05:08 ybsh

@ybsh I found 2 profilers for AMD GPUs. I haven't tried yet, but they seem to be supporting OpenCL kernels profiling. If you don't need to profile against NVIDIA GPUs, you can try those out.
Radeon GPU Profiler
Radeon Compute Profiler

vorj avatar Aug 05 '19 06:08 vorj

@vorj @ybsh we need NVIDIA GPU because we need to know why clpy is slower than cupy

LWisteria avatar Aug 05 '19 06:08 LWisteria

If we have no way to profile on NVIDIA GPUs, you must use other profiling tools for Python and Cython itself.

LWisteria avatar Aug 05 '19 06:08 LWisteria

Thank you again for your help @vorj , I bumped into two "performance tools" which are linked to by NVIDIA's website . Do these look like what we are looking for? Both seem to "support NVIDIA GPUs and OpenCL". TAU Vampir Since I'm new both to graphic and GPU profilers, I really appreciate your opinions. @LWisteria @vorj

ybsh avatar Aug 05 '19 06:08 ybsh

I'm not soooo good at OpenCL profiling itself. Why don't you just try it?

LWisteria avatar Aug 05 '19 06:08 LWisteria

Before going into graphic/GPGPU tools, I have profiled train_mnist.py for CuPy/ClPy (again on titanv, with cProfile), this time with the same number of epochs and iterations as the performance report.

CuPy

  • total execution time
real 52.27
user 51.63
sys 1.81
  • top 20 tottimes
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    72000    2.432    0.000    2.931    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
       80    2.189    0.027    2.189    0.027 {built-in method matplotlib._png.write_png}
   218000    2.171    0.000   20.920    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
    70006    2.156    0.000    2.298    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/cupy/core/fusion.py:693(__call__)
   102000    2.138    0.000    2.138    0.000 {method 'dot' of 'cupy.core.core.ndarray' objects}
    52080    2.047    0.000    2.232    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
    50000    1.558    0.000    1.614    0.000 {method 'sum' of 'cupy.core.core.ndarray' objects}
    42003    1.335    0.000    2.295    0.000 {built-in method cupy.core.core.array}
    12000    1.305    0.000   11.798    0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
    33340    1.226    0.000    1.226    0.000 {built-in method numpy.concatenate}
    14000    1.086    0.000    4.311    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
    42000    1.008    0.000    2.259    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
    24000    0.832    0.000    0.833    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
   372001    0.822    0.000    2.815    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)
    14000    0.783    0.000    0.783    0.000 {built-in method cupy.cudnn.get_handle}
    12000    0.637    0.000    0.683    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
   253111    0.625    0.000    0.864    0.000 {built-in method numpy.array}
   399999    0.596    0.000    1.090    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:160(__init__)
4550436/4550431    0.582    0.000    0.768    0.000 {built-in method builtins.isinstance}
  1202000    0.581    0.000    1.295    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/datasets/tuple_dataset.py:41(__getitem__)

ClPy

  • total execution time
real 77.34
user 75.85
sys 2.42
  • top 20 tottimess
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   102000    4.765    0.000    4.765    0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
    72000    4.505    0.000    9.484    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
    52080    3.873    0.000    4.493    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
    98000    3.491    0.000    5.770    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
    64000    2.670    0.000    4.281    0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
   752502    2.434    0.000    4.534    0.000 {built-in method numpy.array}
999976/983678    2.380    0.000    3.336    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       80    2.179    0.027    2.179    0.027 {built-in method matplotlib._png.write_png}
   218000    2.168    0.000   34.324    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
   582836    2.100    0.000    2.100    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)
    42003    1.684    0.000    2.333    0.000 {built-in method clpy.core.core.array}
    14000    1.488    0.000    9.758    0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
    42000    1.446    0.000    3.957    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
    12000    1.390    0.000   17.243    0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
   582836    1.303    0.000    8.662    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:274(_get_void_ptr)
   582836    1.218    0.000    1.218    0.000 (HOMEDIR)/.pyenv/versions/3.6.5/lib/python3.6/ctypes/__init__.py:484(cast)
   582836    1.105    0.000    1.105    0.000 {method 'from_buffer' of '_ctypes.PyCArrayType' objects}
    12000    1.083    0.000    2.061    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
    24000    1.021    0.000    1.070    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
   372001    0.881    0.000    2.972    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)

2 observations:

  • The main 'bottleneck' I have reported in the previous result turns out to be less important in larger workload sizes.
  • Those two groups have many common codepaths, and they are overall more preformant in CuPy. We cannot ascribe the performance gap to just a few codepaths.

ybsh avatar Aug 07 '19 06:08 ybsh

I think I should try samples with more conspicuous performance gaps, such as word2vec (CuPy: 192s, ClPy: 28s).

ybsh avatar Aug 07 '19 06:08 ybsh

@LWisteria I ran Chainer's train_word2vec.py on titanv with the default epochs/iterations (i.e. the same as the performance report), but the execution times were much longer than the reported figures even without cProfile enabled. Do you think I have misconfigured/misinterpreted something?

(CuPy [s] / ClPy [s])

  • Reported: 27.84 / 192.12
  • My try: 94.62 / 503.71

ybsh avatar Aug 07 '19 07:08 ybsh

@ybsh I don't know why but I think you don't need to know the reason to solve this issue. Do you need?

LWisteria avatar Aug 07 '19 08:08 LWisteria

@LWisteria Maybe no for right now, but maybe yes in the long run. I thought it was a good thing to find hidden configuration differences (if there are) for assuring reproducibility. At least the benchmark itself is working, so I am currently working on bottleneck trackdown.

ybsh avatar Aug 07 '19 08:08 ybsh

@ybsh Now you confirmed the fact "ClPy is slower than CuPy", even with other configs/settings than the last report. So please keep on doing your work with your configs/setting!

LWisteria avatar Aug 07 '19 08:08 LWisteria

Profiled train_word2vec.py with cProfile on titanv, epoch = 1.

CuPy

Execution time: 5.791 s

  • top 10 tottime
  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1004    4.133    0.004    4.156    0.004 {method 'get' of 'cupy.core.core.ndarray' objects}
     1004    0.086    0.000    4.297    0.004 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
        1    0.085    0.085    0.085    0.085 {built-in method cupy.cuda.nvrtc.compileProgram}
     6880    0.078    0.000    4.886    0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
     1860    0.078    0.000    0.096    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     2008    0.075    0.000    0.142    0.000 {built-in method cupy.core.core.array}
     1004    0.068    0.000    0.167    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
     1005    0.051    0.000    0.063    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
      930    0.048    0.000    0.364    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
     2008    0.041    0.000    0.041    0.000 {method 'take' of 'numpy.ndarray' objects}

ClPy

Execution time: 25.878 s

  • top 10 tottime
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2008   21.344    0.011   21.388    0.011 {built-in method clpy.core.core.array}
     1004    1.696    0.002    1.711    0.002 {method 'get' of 'clpy.core.core.ndarray' objects}
       74    0.363    0.005    0.363    0.005 {method 'poll' of 'select.poll' objects}
     3724    0.185    0.000    0.360    0.000 {method 'fill' of 'clpy.core.core.ndarray' objects}
     1004    0.140    0.000    2.106    0.002 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
     1860    0.133    0.000    0.297    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     1005    0.100    0.000    0.230    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
      102    0.097    0.001    0.097    0.001 {built-in method posix.read}
     1004    0.096    0.000    0.191    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
    25013    0.092    0.000    0.172    0.000 {built-in method numpy.array}
    23005    0.080    0.000    0.080    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)

This shows a great performance difference in {built-in method clpy.core.core.array}. Where is this called from?

  • top 10 cumtime (runtime including callees)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.011    0.011   25.878   25.878 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
      930    0.008    0.000   25.020    0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
      930    0.004    0.000   25.012    0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
     1004    0.002    0.000   21.411    0.021 <string>:171(convert)
     2008    0.006    0.000   21.409    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
     2008    0.006    0.000   21.397    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
     2008    0.001    0.000   21.389    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
     2008   21.344    0.011   21.388    0.011 {built-in method clpy.core.core.array}
      930    0.016    0.000    3.837    0.004 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)
     6880    0.076    0.000    2.917    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)

The top 8 show the effective call stack.

ybsh avatar Aug 07 '19 08:08 ybsh

Where is this called from?

You can use graphviz to make call tree from cprofile results.

LWisteria avatar Aug 07 '19 08:08 LWisteria

Code location: https://github.com/fixstars/clpy/blob/3a97570b87ad57e85e837678c863b62cda6e4513/clpy/core/core.pyx#L2153 Now I'm going on to Cython-level profiling.

ybsh avatar Aug 07 '19 09:08 ybsh

With a Cython-level profiling, it turned out to be copy_from_host() that train_word2vec.py was spending most of its time in ClPy.

Again a Cython output, sorted by cumulative execution time:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.009    0.009   10.067   10.067 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
      930    0.004    0.000    9.600    0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
      930    0.004    0.000    9.587    0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
     1004    0.001    0.000    6.895    0.007 <string>:171(convert)
     2008    0.005    0.000    6.893    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
     2008    0.006    0.000    6.882    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
     2008    0.001    0.000    6.875    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
     2008    0.001    0.000    6.874    0.003 {built-in method clpy.core.core.array}
     2008    0.001    0.000    6.872    0.003 clpy/core/core.pyx:2154(array (wrapper))
     2008    0.008    0.000    6.871    0.003 clpy/core/core.pyx:2154(array)
     2008    0.013    0.000    6.840    0.003 clpy/core/core.pyx:1734(set)
     2008    6.780    0.003    6.781    0.003 clpy/backend/memory.pyx:354(copy_from_host)
      930    0.012    0.000    2.609    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)

to_gpu() is called 2008 times, and in every call copy_from_host() is called once. As you can see in the previous report, CuPy's call count of {built-in method clpy.core.core.array} is also 2008.

Sidenote: The total execution time is around 10s, which is half as long as the previous report and I have no idea why this is happening...

How I observed this

Add this line

# cython: profile=True

to the head of

  • clpy/backend/memory.pyx (location of array())
  • clpy/core/core.pyx (location of copy_from_host()) , and rebuild ClPy by python setup.py install

ybsh avatar Aug 09 '19 07:08 ybsh

copy_from_host() could report its processing time including other processes'. As you know about OpenCL, clEnqueueWrite/ReadBuffer() waits to finish all tasks in the CommandQueue. You need to separate such waiting time from transferring time itself.

LWisteria avatar Aug 09 '19 07:08 LWisteria

Thank you. Perhaps it is time to turn to OpenCL tracing tools, because it seems very hard to find out what is happening inside only with Cython profiling tools...

ybsh avatar Aug 09 '19 07:08 ybsh

You can insert clQueueFinish() explicitly before enqueuing read/write.

LWisteria avatar Aug 09 '19 07:08 LWisteria

@LWisteria OK, I will try that and see if the transfers are taking long or not.

ybsh avatar Aug 09 '19 07:08 ybsh

Added clFinish() before the clEnququeWriteBuffer in memory.pyx:354(copy_from_host). As @LWisteria has pointed out its possibility, clEnququeWriteBuffer was almost idle waiting for previous tasks 😇

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.009    0.009    9.941    9.941 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
      930    0.004    0.000    9.447    0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
      930    0.004    0.000    9.435    0.010 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
     1004    0.002    0.000    6.740    0.007 <string>:171(convert)
     2008    0.005    0.000    6.738    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
     2008    0.005    0.000    6.727    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
     2008    0.001    0.000    6.720    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
     2008    0.001    0.000    6.719    0.003 {built-in method clpy.core.core.array}
     2008    0.001    0.000    6.718    0.003 clpy/core/core.pyx:2154(array (wrapper))
     2008    0.008    0.000    6.717    0.003 clpy/core/core.pyx:2154(array)
     2008    0.012    0.000    6.685    0.003 clpy/core/core.pyx:1734(set)
     2008    0.005    0.000    6.628    0.003 clpy/backend/memory.pyx:354(copy_from_host)
     2008    6.598    0.003    6.598    0.003 clpy/backend/opencl/api.pyx:306(Finish)
      930    0.012    0.000    2.610    0.003 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)

ybsh avatar Aug 09 '19 08:08 ybsh

Now that the data transfers proved not to be the culprit, I'll try OpenCL tracing tools.

ybsh avatar Aug 09 '19 08:08 ybsh

Can't you get function calling order? The calling order is enough to check what command was enqueued and consumes time.

LWisteria avatar Aug 09 '19 08:08 LWisteria

I'm trying Python's trace module.

ybsh avatar Aug 14 '19 01:08 ybsh

As I mentioned above, I tried Python's trace module configuring all the options except for file I/O related ones, but there seems to be no effective way to list the order of ClPy-level function calls (it does not even get any information inside calls of functions defined with Cython). Looking for another tool that looks handier than GUI tools...

ybsh avatar Aug 14 '19 03:08 ybsh