uftrace icon indicating copy to clipboard operation
uftrace copied to clipboard

python function tracing for both python2 and python3

Open honggyukim opened this issue 6 years ago • 17 comments

This PR is based on review/python-tracing-v3.

Related issue: #436

honggyukim avatar Oct 07 '19 17:10 honggyukim

This is a new PR based on #680, but rebased onto master with several conflict fixes.

@rls1004 Please take a look at it if you're still interested in.

honggyukim avatar Oct 07 '19 17:10 honggyukim

Here are some tests.

$ cat p-abc.py
#!/usr/bin/env python2

import os

def c():
    return os.getpid()

def b():
    return c()

def a():
    return b()

print(a())
$ chmod +x p-abc.py

$ uftrace --python --no-libcall p-abc.py
68270
# DURATION     TID     FUNCTION
            [ 68270] | <module:__main__>() {
            [ 68270] |   a() {
            [ 68270] |     b() {
   2.610 us [ 68270] |       c();
   6.717 us [ 68270] |     } /* b */
   9.997 us [ 68270] |   } /* a */
  46.204 us [ 68270] | } /* <module:__main__> */
  10.053 us [ 68270] | _remove();
   4.790 us [ 68270] | _remove();

Another test with fork.

$ cat p-fork.py
#!/usr/bin/env python

import os

def c():
    return os.getpid() % 100000

def b():
    return c() + 1

def a():
    return b() - 1

def main():
    ret = 0
    pid = os.fork()
    if pid:
        os.wait()

    ret += a()

    return ret

if __name__=='__main__':
    main()
$ chmod +x p-fork.py

$ uftrace --python --no-libcall p-fork.py
# DURATION     TID     FUNCTION
            [ 68311] | <module:__main__>() {
            [ 68311] |   main() {
            [ 68311] |     /* linux:sched-out */
            [ 68322] | a() {
            [ 68322] |   b() {
  37.690 us [ 68322] |     c();
  52.883 us [ 68322] |   } /* b */
  79.257 us [ 68322] | } /* a */
            [ 68322] | } /* main */
            [ 68322] | } /* <module:__main__> */
   9.980 us [ 68322] | _remove();
   4.326 us [ 68322] | _remove();
   6.881 ms [ 68311] |     /* linux:sched-in */
            [ 68311] |     a() {
            [ 68311] |       b() {
   8.196 us [ 68311] |         c();
  15.087 us [ 68311] |       } /* b */
  24.763 us [ 68311] |     } /* a */
   7.549 ms [ 68311] |   } /* main */
   7.569 ms [ 68311] | } /* <module:__main__> */
  10.060 us [ 68311] | _remove();
   4.130 us [ 68311] | _remove();

honggyukim avatar Oct 07 '19 17:10 honggyukim

However, it's a bit weird that python3 shows some function calls ahead, which begins with _find_and_load().

$ cat p-abc3.py
#!/usr/bin/env python3

import os

def c():
    return os.getpid()

def b():
    return c()

def a():
    return b()

print(a())
$ chmod +x p-abc3.py

$ uftrace --python --no-libcall -D 5 p-abc3.py
68399
# DURATION     TID     FUNCTION
            [ 68399] | _find_and_load() {
   4.663 us [ 68399] |   __init__();
            [ 68399] |   __enter__() {
            [ 68399] |     _get_module_lock() {
   8.027 us [ 68399] |       __init__();
  42.120 us [ 68399] |     } /* _get_module_lock */
  15.064 us [ 68399] |     acquire();
  68.584 us [ 68399] |   } /* __enter__ */
            [ 68399] |   _find_and_load_unlocked() {
            [ 68399] |     _find_spec() {
   2.193 us [ 68399] |       __enter__();
   7.750 us [ 68399] |       find_spec();
   2.777 us [ 68399] |       __exit__();
   1.766 us [ 68399] |       __enter__();
   4.070 us [ 68399] |       find_spec();
   2.264 us [ 68399] |       __exit__();
   1.667 us [ 68399] |       __enter__();
            [ 68399] |       find_spec() {
 856.770 us [ 68399] |         _get_spec();
 874.013 us [ 68399] |       } /* find_spec */
   3.190 us [ 68399] |       __exit__();
 990.430 us [ 68399] |     } /* _find_spec */
            [ 68399] |     _load_unlocked() {
            [ 68399] |       module_from_spec() {
   1.850 us [ 68399] |         create_module();
   4.090 us [ 68399] |         _new_module();
 171.060 us [ 68399] |         _init_module_attrs();
 196.000 us [ 68399] |       } /* module_from_spec */
            [ 68399] |       exec_module() {
 652.530 us [ 68399] |         get_code();
 534.144 us [ 68399] |         _call_with_frames_removed();
   1.205 ms [ 68399] |       } /* exec_module */
   2.983 us [ 68399] |       _verbose_message();
   1.436 ms [ 68399] |     } /* _load_unlocked */
   2.459 ms [ 68399] |   } /* _find_and_load_unlocked */
            [ 68399] |   __exit__() {
   9.910 us [ 68399] |     release();
  15.010 us [ 68399] |   } /* __exit__ */
   5.803 us [ 68399] |   cb();
   2.591 ms [ 68399] | } /* _find_and_load */
  11.496 us [ 68399] | getpreferredencoding();
            [ 68399] | __init__() {
   2.343 us [ 68399] |   __init__();
   8.663 us [ 68399] | } /* __init__ */
  13.974 us [ 68399] | decode();
            [ 68399] | <module:__main__>() {
            [ 68399] |   a() {
            [ 68399] |     b() {
   5.494 us [ 68399] |       c();
  10.254 us [ 68399] |     } /* b */
  14.280 us [ 68399] |   } /* a */
  57.117 us [ 68399] | } /* <module:__main__> */

I will take a look at it when I have some time.

honggyukim avatar Oct 07 '19 17:10 honggyukim

@namhyung Oops. Sorry that I pushed this work into your repository. My repository setting had some problems. Would it be better to close and open a new PR in my repository?

honggyukim avatar Oct 08 '19 00:10 honggyukim

The python3 tracing shows different result in another machine.

$ uftrace --python --no-libcall p-abc3.py 
11872
# DURATION     TID     FUNCTION
  14.206 us [ 11872] | getpreferredencoding();
            [ 11872] | __init__() {
   2.947 us [ 11872] |   __init__();
  11.820 us [ 11872] | } /* __init__ */
  17.623 us [ 11872] | decode();
            [ 11872] | <module:__main__>() {
            [ 11872] |   a() {
            [ 11872] |     b() {
   5.407 us [ 11872] |       c();
  10.047 us [ 11872] |     } /* b */
  14.130 us [ 11872] |   } /* a */
  73.452 us [ 11872] | } /* <module:__main__> */
   5.713 us [ 11872] | _ag();

$ python3 --version
Python 3.6.8

honggyukim avatar Oct 08 '19 02:10 honggyukim

@namhyung Oops. Sorry that I pushed this work into your repository. My repository setting had some problems. Would it be better to close and open a new PR in my repository?

Nop, it's ok, just leave it :)

namhyung avatar Oct 08 '19 03:10 namhyung

I want to have a single module to cover both python v2 or v3. Then I think we should finish #897 first.

namhyung avatar Oct 08 '19 03:10 namhyung

A similar log related to _find_and_load() appears when importing a module using importlib.

p-abc3.py :

#!/usr/bin/env python3
  
import importlib
os = importlib.import_module("os")

def c():
    return os.getpid()

def b():
    return c()

def a():
    return b()

print(a())

result :

$ ./uftrace --python --no-libc ../test/p-abc3.py 
14098
# DURATION     TID     FUNCTION
   4.397 us [ 14098] | getpreferredencoding();
            [ 14098] | __init__() {
   0.685 us [ 14098] |   __init__();
   3.236 us [ 14098] | } /* __init__ */
   4.202 us [ 14098] | decode();
            [ 14098] | <module:__main__>() {
            [ 14098] |   import_module() {
            [ 14098] |     _gcd_import() {
   1.676 us [ 14098] |       _sanity_check();
            [ 14098] |       _find_and_load() {
   0.970 us [ 14098] |         __init__();
            [ 14098] |         __enter__() {
            [ 14098] |           _get_module_lock() {
   2.333 us [ 14098] |             __init__();
  10.168 us [ 14098] |           } /* _get_module_lock */
   4.209 us [ 14098] |           acquire();
  17.157 us [ 14098] |         } /* __enter__ */
            [ 14098] |         __exit__() {
   2.791 us [ 14098] |           release();
   4.134 us [ 14098] |         } /* __exit__ */
   1.644 us [ 14098] |         cb();
            [ 14098] |         _lock_unlock_module() {
            [ 14098] |           _get_module_lock() {
   1.676 us [ 14098] |             __init__();
   7.506 us [ 14098] |           } /* _get_module_lock */
   2.683 us [ 14098] |           acquire();
   2.258 us [ 14098] |           release();
  15.392 us [ 14098] |         } /* _lock_unlock_module */
   1.370 us [ 14098] |         cb();
  50.181 us [ 14098] |       } /* _find_and_load */
  55.201 us [ 14098] |     } /* _gcd_import */
  58.700 us [ 14098] |   } /* import_module */
            [ 14098] |   a() {
            [ 14098] |     b() {
   1.601 us [ 14098] |       c();
   2.728 us [ 14098] |     } /* b */
   3.880 us [ 14098] |   } /* a */
 246.309 us [ 14098] | } /* <module:__main__> */
   1.389 us [ 14098] | _ag();

p-abc3.py for python 2 :

$ ./uftrace --python --no-libc ../test/p-abc3.py 
14193
# DURATION     TID     FUNCTION
            [ 14193] | <module:__main__>() {
  35.421 us [ 14193] |   <module:importlib>();
   4.749 us [ 14193] |   import_module();
            [ 14193] |   a() {
            [ 14193] |     b() {
   6.920 us [ 14193] |       c();
   8.125 us [ 14193] |     } /* b */
   9.174 us [ 14193] |   } /* a */
   2.903 ms [ 14193] | } /* <module:__main__> */
   4.546 us [ 14193] | _remove();
   1.435 us [ 14193] | _remove();

rls1004 avatar Oct 11 '19 15:10 rls1004

@rls1004 Thanks for testing. I just found that my system indirectly calls python3 as follows:

$ which python3
/home/honggyu/src/depot_tools/git/depot_tools/python3

It's actually a shell script and its content is as follows:

$ cat /home/honggyu/src/depot_tools/git/depot_tools/python3
#!/usr/bin/env bash

base_dir=$(dirname "$0")

PYTHON3_BIN_RELDIR="$(cat $base_dir/python3_bin_reldir.txt | xargs echo)"
PATH="bootstrap-3.8.0b1.chromium.1_bin/python3/bin":"bootstrap-3.8.0b1.chromium.1_bin/python3/bin/Scripts":"$PATH"
"$base_dir/bootstrap-3.8.0b1.chromium.1_bin/python3/bin/python3" "$@"

That's the reason of the difference. I tested it again by specifying the shebang explicitly to /usr/local/bin/python3 instead of /usr/bin/env python3 as follows:

$ cat p-abc3-2.py
#!/usr/local/bin/python3

import os

def c():
    return os.getpid()

def b():
    return c()

def a():
    return b()

print(a())

The result is same as expected without long trace output inside _find_and_load.

$ uftrace --python --no-libcall p-abc3-2.py
21656
# DURATION     TID     FUNCTION
  15.450 us [ 21656] | getpreferredencoding();
            [ 21656] | __init__() {
   2.947 us [ 21656] |   __init__();
  10.163 us [ 21656] | } /* __init__ */
  15.436 us [ 21656] | decode();
            [ 21656] | <module:__main__>() {
            [ 21656] |   a() {
            [ 21656] |     b() {
   2.883 us [ 21656] |       c();
   7.733 us [ 21656] |     } /* b */
  12.054 us [ 21656] |   } /* a */
  58.123 us [ 21656] | } /* <module:__main__> */

honggyukim avatar Oct 12 '19 03:10 honggyukim

Here is another test with uftrace/tests/runtest.py.

$ ./runtest.py 001 -p -O0
Start 1 tests with 1 worker
Test case                 pg
------------------------: O0
001 basic               : OK


runtime test stats
====================
total     1  Tests executed (success: 100.00%)
  OK:     1  Test succeeded
  OK:     0  Test succeeded (with some fixup)
  NG:     0  Different test result
  NZ:     0  Non-zero return value
  SG:     0  Abnormal exit by signal
  TM:     0  Test ran too long
  BI:     0  Build failed
  LA:     0  Unsupported Language
  SK:     0  Skipped

If uftrace tries to record the execution, it gets stuck and never finish as follows:

$ uftrace record --python --no-libcall ./runtest.py 001 -p -O0
Start 1 tests with 1 worker
Test case                 pg
------------------------: O0
(... got stuck here ...)

The remaining record looks fine and need to find the reason of getting stuck here.

honggyukim avatar Oct 12 '19 04:10 honggyukim

Here is replay output snippet.

# DURATION     TID     FUNCTION
            [ 23175] | <module:__main__>() {
            [ 23175] |   <module:random>() {
            [ 23175] |     <module:__future__>() {
   4.657 us [ 23175] |       _Feature();
   4.857 us [ 23175] |       __init__();
   3.023 us [ 23175] |       __init__();
...
            [ 23175] |       __call__() {
   1.874 us [ 23175] |         sub_debug();
 247.328 us [ 23190] |             } /* call */
 263.652 us [ 23190] |           } /* __del__ */
 274.966 us [ 23190] |         } /* _get_candidate_names */
            [ 23175] |         _terminate_pool() {
            [ 23190] |         next() {
   2.227 us [ 23175] |           debug();
  11.997 us [ 23190] |         } /* next */
   2.020 us [ 23175] |           debug();
            [ 23175] |           _help_stuff_finish() {
            [ 23190] |         next() {
   1.303 us [ 23175] |             debug();
   8.293 us [ 23190] |         } /* next */
  32.261 us [ 23190] |         choice();
            [ 23190] |         _get_candidate_names() {
   2.660 us [ 23190] |           __del__();
   9.476 us [ 23190] |         } /* _get_candidate_names */
   8.093 us [ 23190] |         next();
   7.423 us [ 23190] |         next();
   7.067 us [ 23190] |         choice();
 225.588 us [ 23195] |             } /* _put */
 244.589 us [ 23195] |           } /* put */
 258.959 us [ 23195] |         } /* apply_async */
  15.040 us [ 23195] |         _eintr_retry_call();

uftrace stopped tracing with remaining functions
================================================
task: 23175
[5] _help_stuff_finish
[4] _terminate_pool
[3] __call__
[2] _run_finalizers
[1] _exit_function
[0] _run_exitfuncs

task: 23190
[3] rng
[2] getenv
[1] _candidate_tempdir_list
[0] _newname

task: 23195
[3] _handle_exitstatus
[2] _set_cloexec_flag
[1] pipe_cloexec
[0] _get_default_tempdir

task: 23200
[1] __stop
[0] _exitfunc

It seems that it gets stuck duing pipe_cloexec.

$ uftrace replay --tid 23195
# DURATION     TID     FUNCTION
            [ 23195] | check_serial_case() {
  29.300 us [ 23195] |   currentThread();
  33.200 us [ 23195] |   <module:t001_basic>();
            [ 23195] |   <module:runtest>() {
            [ 23195] |     __init__() {
   5.113 us [ 23195] |       __init__();
            [ 23195] |       RLock() {
...
            [ 23195] |   pipe_cloexec() {
            [ 23195] |     _set_cloexec_flag() {
  13.141 us [ 23195] |       _eintr_retry_call();
            [ 23195] |       _handle_exitstatus() {
   9.667 us [ 23195] |         _eintr_retry_call();
   9.410 us [ 23195] |         _internal_poll();
            [ 23195] |         apply_async() {
            [ 23195] |           put() {
 225.588 us [ 23195] |             _put();
 244.589 us [ 23195] |           } /* put */
 258.959 us [ 23195] |         } /* apply_async */
  15.040 us [ 23195] |         _eintr_retry_call();

uftrace stopped tracing with remaining functions
================================================
task: 23195
[3] _handle_exitstatus
[2] _set_cloexec_flag
[1] pipe_cloexec
[0] _get_default_tempdir

honggyukim avatar Oct 12 '19 04:10 honggyukim

There seems to be a problem when storing data in a shared object.

The result of the test process is stored in the shared.results dictionary via the callback function save_test_result().

https://github.com/namhyung/uftrace/blob/322185a09c85e280733949db661ce975253aef7f/tests/runtest.py#L689-L697

https://github.com/namhyung/uftrace/blob/322185a09c85e280733949db661ce975253aef7f/tests/runtest.py#L552-L559

After each test is completed, the value stored in the shared object is displayed.

https://github.com/namhyung/uftrace/blob/322185a09c85e280733949db661ce975253aef7f/tests/runtest.py#L708-L714

However, when tracing using uftrace, data is not saved properly in the shared object. Thus, because there is no value stored in shared.results, it cannot exit the while statement and becomes infinite loop.

Here is a PoC.

#!/usr/bin/env python
import multiprocessing
from functools import partial

def foo(name):
    return name
def save(result, shared):
    shared.results[result] = 1

num_list = ['p1', 'p2', 'p3', 'p4']

manager = multiprocessing.Manager()
shared = manager.dict()
shared.results = dict()

clbk = partial(save, shared=shared)

pool = multiprocessing.Pool(processes=2)
serial_pool = multiprocessing.Pool(1)

for name in num_list:
    serial_pool.apply_async(foo, args=[name], callback=clbk)

pool.close()
pool.join()

print(shared.results)

Here is normal result.

$ python mp.py
{'p1': 1, 'p2': 1, 'p3': 1, 'p4': 1}

Run with uftrace.

$ uftrace record --python --no-libc ./mp.py 
{}

rls1004 avatar Oct 12 '19 17:10 rls1004

@rls1004 Thanks very much for writing the simple reproducible script.

I've tested it with uftrace and got the same problem.

$ ./mp.py
{'p2': 1, 'p3': 1, 'p1': 1, 'p4': 1}

$ uftrace record --python --no-libcall ./mp.py
{}

However, it's not a problem of our uftrace implementation. It seems that the problem comes from the sys.settrace that the current implementation relies on.

The builtin feature also shows the same problem as follows:

$ python -mtrace --trackcalls ./mp.py
{}

calling relationships:

*** ./mp.py ***
  --> /usr/lib/python2.7/functools.py
    mp.<module> -> functools.<module>
  --> /usr/lib/python2.7/multiprocessing/__init__.py
    mp.<module> -> __init__.<module>
    mp.<module> -> __init__.Manager
    mp.<module> -> __init__.Pool
...

I've also tested it with python3, but it doesn't make any difference unfortunately.

honggyukim avatar Oct 12 '19 20:10 honggyukim

We have to find another way or fix the problem in the mainline cpython implementation.

honggyukim avatar Oct 12 '19 20:10 honggyukim

@rls1004 If it's difficult to fix the problem inside CPython, is it possible you to propose the correct tracing feature in RustPython community? The only thing we need is to have hooks for python function entry and exit.

honggyukim avatar Oct 13 '19 07:10 honggyukim

In addition, our final goal is to provide the unified function trace across python script and its internal interpreter together.

The current sys.settrace is very heavy and puts a lot of overhead so I hope the python interpreter provides lightweight hooks for python script.

honggyukim avatar Oct 13 '19 07:10 honggyukim

@rls1004 Thanks for making a bug report to python community.

  • https://bugs.python.org/issue38511

Let's see if it can be fixed.

honggyukim avatar Oct 22 '19 01:10 honggyukim