PythonCall.jl icon indicating copy to clipboard operation
PythonCall.jl copied to clipboard

Segfault when calling python function repeatedly

Open jco255 opened this issue 6 months ago • 4 comments

my first issue go easy

Affects: PythonCall

Desribe: I'm seeing segfaults after running for a while. I've tried to create a MWE, it usually segfaults first go, but sometimes I have to re-run the for-loop a few times. It feels like a gc race condition like i've seen in the other issues, but I have no real proof other than the more stuff i have going on, the faster it segfaults

using PythonCall

f1 = @pyexec """ 
def pyfunc(params):
    return 0
""" => pyfunc

params = Dict{String,Float64}(
    "A" => 1,
    "B" => 2,
    "C" => 3,
    )

for i in 1:10_000_000
    f1(params)
end
[965155] signal 11 (1): Segmentation fault                                                                                                                                                     
in expression starting at /home/olssoj2/git/julia_local/scratch_hy_mwe.jl:18                                                                                                                   
unknown function (ip: 0x7f30d83b03ae)                                                                                                                                                          
unknown function (ip: 0x7f30d82b1187)                                                                                                                                                          
Py_DecRef at /home/olssoj2/.julia/packages/PythonCall/L4cjh/src/C/pointers.jl:303 [inlined]                                                                                                    
pydel! at /home/olssoj2/.julia/packages/PythonCall/L4cjh/src/Core/Py.jl:114                                                                                                                    
#pycall#21 at /home/olssoj2/.julia/packages/PythonCall/L4cjh/src/Core/builtins.jl:244                                                                                                          
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:831
#_#11 at /home/olssoj2/.julia/packages/PythonCall/L4cjh/src/Core/Py.jl:357
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:831
Py at /home/olssoj2/.julia/packages/PythonCall/L4cjh/src/Core/Py.jl:357
top-level scope at /home/olssoj2/git/julia_local/scratch_hy_mwe.jl:19
jl_toplevel_eval_flex at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]

Julia Version 1.11.5 Commit 760b2e5b739 (2025-04-14 06:53 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 160 × Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz WORD_SIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, icelake-server) Threads: 1 default, 0 interactive, 1 GC (on 160 virtual cores) Environment: JULIA_NUM_THREADS = JULIA_VSCODE_REPL = 1 JULIA_EDITOR = code LD_LIBRARY_PATH = /opt/rh/gcc-toolset-14/root/usr/lib64:/opt/rh/gcc-toolset-14/root/usr/lib

(@v1.11) pkg> status PythonCall Status ~/.julia/environments/v1.11/Project.toml [6099a3de] PythonCall v0.9.25

jco255 avatar Jun 14 '25 13:06 jco255

if I trace the call stack

    ...
    elseif !isempty(args)
        args_ = pytuple_fromiter(args)
        ans = pycallargs(f, args_)
        pydel!(args_)     <------   this is builtins.jl:244
        ans
    else

... which calls pydel! below

    ptr = getptr(x)
    if ptr != C.PyNULL
        C.Py_DecRef(ptr)      <----  this is Py.jl:144 leads to segfault
        setptr!(x, C.PyNULL)
    end
    push!(PYNULL_CACHE, x)
    return
end

pydel! has comments with the word "DANGER!" and is described as including an optimization to reuse Py objects.

  1. I'll try to reason through this ref-counting but yeah it is not trivial
  2. Does anyone know if there is a band-aid, maybe which disables optimizations, but would ensure validity

jco255 avatar Jun 18 '25 12:06 jco255

if I run in gdb it shows

Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0x00007fff9f0411f8 in subtype_dealloc () from /usr/lib64/libpython3.11.so.1.0 Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-225.el8.x86_64 libffi-3.1-24.el8.x86_64 mpdecimal-2.5.1-3.el8.x86_64 python3.11-libs-3.11.11-1.el8_10.x86_64 (gdb) bt #0 0x00007fff9f0411f8 in subtype_dealloc () from /usr/lib64/libpython3.11.so.1.0 #1 0x00007fff9efea45a in tupledealloc () from /usr/lib64/libpython3.11.so.1.0 #2 0x00007ffff5f0f034 in ?? () #3 0x00007fffa5afb9a0 in jl_system_image_data () from /home/wsl2user/.julia/compiled/v1.11/PythonCall/WdXsa_pPI1g.so #4 0x00007fffffffca70 in ?? () #5 0x00007fffa64eb130 in ?? () from /home/wsl2user/.julia/compiled/v1.11/PythonCall/WdXsa_pPI1g.so #6 0x00007fffea9fc080 in ?? () #7 0x00007fffffffca60 in ?? () #8 0x00007ffff6060a8b in jl_f_finalizer (F=, args=0x7fffebfa87b0, nargs=) at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:2031

does that mean the python object was decref'd too many times?

jco255 avatar Jun 19 '25 18:06 jco255

A Solution

pd = pydict(params)
for i in 1:10_000_000
    rp1(pd)
end

the first version above wraps the julia dict in python as a juliacall.DictValue, while this version converts the julia type to a python dict first. ~~fwiw I can't get this latter one to crash~~ this leads to a memory leak

jco255 avatar Jun 20 '25 01:06 jco255

Ok new approach. We were making a bazillion pytuples and DictValues every time we called the function. Instead we can just cache these. Maybe this is the short-term (and possibly long-term) solution -- update raw array values but don't make a lot of intermediate structures. pd = pydict(params) pt = pytuple((pd,))

for i in 1:10_000_000 PythonCall.Core.pycallargs(rp1, pt) end

jco255 avatar Jun 21 '25 14:06 jco255

Thanks for the PR. I believe the issue was just fixed by #618 which is now merged. Your reproducer crashes for me before the PR but doesn't after. Can you verify the issue is fixed for you please? You can install the dev version of PythonCall like

pkg> add PythonCall#main

cjdoris avatar Jul 01 '25 19:07 cjdoris

bravo.

do you have a lot more going on than your earlier less-getptr branch? the first thing i did was try it (96bd22 or 2c2c95a "double-check the cache", not sure which). but this today seems solid af. if you're at juliacon i'll buy all the beers you want

jco255 avatar Jul 02 '25 12:07 jco255

I added a few Base.GC.@preserves around places where we were still using getptr in an unsafe way.

Not at JuliaCon I'm afraid, wrong continent!

Glad it's working for you.

cjdoris avatar Jul 02 '25 12:07 cjdoris