PythonCall.jl icon indicating copy to clipboard operation
PythonCall.jl copied to clipboard

use getptr a lot less

Open cjdoris opened this issue 6 months ago • 15 comments

Defines an unsafe_convert rule for Py to C.PyPtr so we can pass Py directly to C functions.

Also defines incref(::Py) so we don't have to operate on pointers.

Generalises our faked C functions in extras.jl to also allow ::Py arguments in place of ::PyPtr.

Fixes #617 - and more generally removes a lot of places where the Julia GC could have freed a Py while we still have a pointer to the Python object, causing a read-after-free error.

cjdoris avatar May 28 '25 21:05 cjdoris

Fixes some local segfaults I was getting. Nice!

What is left before merging? Let me know if I can help. Eager to see if this fixes some PySR segfaults people have been reporting.

MilesCranmer avatar Jun 04 '25 16:06 MilesCranmer

What is left before merging?

Fix the segfault on Mac 😞 - I've not got a Mac to hand right now so debugging will be painful...

cjdoris avatar Jun 04 '25 16:06 cjdoris

@cjdoris you should try https://github.com/mxschmitt/action-tmate. It's pretty useful for when you don't have the OS that is failing

MilesCranmer avatar Jun 04 '25 17:06 MilesCranmer

I don't see the segfault btw. It just looks like a notebook error?

_________________________ pytest/test_nb.ipynb::Cell 0 _________________________
Notebook cell execution failed
Cell 0: Timeout of 2000 seconds exceeded while executing cell. Failed to interrupt kernel in 5 seconds, so failing without traceback.

Input:
# NBVAL_IGNORE_OUTPUT
import numpy as np
from juliacall import Main as jl

MilesCranmer avatar Jun 05 '25 18:06 MilesCranmer

Yeah a more recent CI had the same issue. I suspect some bug is causing undefined behaviour which might cause segfaults or silently kill the kernel or ...

cjdoris avatar Jun 05 '25 20:06 cjdoris

Oh goodie it's non-deterministic. The Mac test just passed but Linux failed! But Mac did fail with Python 3.13.2 (was previously testing on latest Python 3.13.3). At least I might be able to reproduce this locally on Linux...

cjdoris avatar Jun 05 '25 20:06 cjdoris

I have completely reverted this branch (except for some trivial details in the CI workflow definition) and still get the errors, so these should be fixed first. The error seems to only occur on Mac on Python 3.13.

cjdoris avatar Jun 05 '25 21:06 cjdoris

Argh Heisenbug! I add tmate to be able to observe it and it goes away.

Does anyone have a Mac they can debug this locally with? I just want to run the tests under gdb or something.

cjdoris avatar Jun 05 '25 21:06 cjdoris

I have a mac but don't see the issue myself - just tested. Perhaps it was an upstream dependency which is now fixed?

MilesCranmer avatar Jun 05 '25 22:06 MilesCranmer

Ah sorry let me try Python 3.13, i was using 3.12.

MilesCranmer avatar Jun 05 '25 22:06 MilesCranmer

Actually it does look like the macos test is failing? So you could ssh in now: https://github.com/JuliaPy/PythonCall.jl/actions/runs/15477354381/job/43576415671?pr=618

MilesCranmer avatar Jun 05 '25 22:06 MilesCranmer

I can reproduce the segfault: I see it on 3.13.0-3.13.2, but not on 3.13.3. So perhaps it was a Python bug that is now fixed?

MilesCranmer avatar Jun 05 '25 23:06 MilesCranmer

OK the current failures are due to https://github.com/JuliaLang/julia/issues/58171 so I guess we have another reproducer for that.

cjdoris avatar Jun 06 '25 11:06 cjdoris

I've run several long stress tests on linux with julia 1.11.5 & python 3.13.3/3.13.4, which used to consistently crash before this PR. With the PR, I just get a consistent crash when finishing the stress test in one of the tests, and only on an ubuntu 24.4, but not on an ubuntu 25.4 I tried with julia 1.12beta4, but I get a crash when running the core of my julia code (I need to further investigate):

setName at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/codegen.cpp:170 [inlined]
emit_pointerref at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/intrinsics.cpp:802 [inlined]
emit_intrinsic at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/intrinsics.cpp:1328
emit_call at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/codegen.cpp:5436
emit_expr at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/codegen.cpp:6379
emit_ssaval_assign at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/codegen.cpp:5866
emit_stmtpos at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/codegen.cpp:6168 [inlined]
emit_function at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/codegen.cpp:9348
jl_emit_code at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/codegen.cpp:9727
jl_emit_codeinst at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/codegen.cpp:9798
jl_emit_codeinst_to_jit_impl at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/jitlayers.cpp:761
jl_add_codeinst_to_jit at /cache/build/builder-amdci5-3/julialang/julia-release-1-dot-12/src/gf.c:2898
add_codeinsts_to_jit! at ./../usr/share/julia/Compiler/src/typeinfer.jl:1394

dpinol avatar Jun 06 '25 13:06 dpinol

@vchuravy do you think the jl_object_id__cold might be coming from something fixable on the PythonCall.jl side, or is it definitely something in the Julia source itself?

MilesCranmer avatar Jun 14 '25 20:06 MilesCranmer

@cjdoris do you think we could merge this? Even without fixing https://github.com/JuliaLang/julia/issues/58171 I think it's still useful to have these fixes.

In fact I do wonder if that issue isn't a Julia bug (since we've never seen it anywhere besides the PythonCall.jl instance). Maybe it's due to how Python multithreading changed in 1.13 or something...

MilesCranmer avatar Jun 30 '25 01:06 MilesCranmer

do you think the jl_object_id__cold might be coming from something fixable on the PythonCall.jl side, or is it definitely something in the Julia source itself?

I would say it's a high likelyhood that it is a missunderstanding between Python, PythonCall.jl and Julia's memory semantics. So not at all clear that it is in the Julia source itself.

vchuravy avatar Jun 30 '25 08:06 vchuravy

OK I undid the debugging changes and committed the original PR with a few extra GC.@preserves around. Tests passing now but that might just be random chance, we'll see.

cjdoris avatar Jul 01 '25 11:07 cjdoris

In fact I do wonder if that issue isn't a Julia bug (since we've never seen it anywhere besides the PythonCall.jl instance). Maybe it's due to how Python multithreading changed in 1.13 or something...

What changed?

cjdoris avatar Jul 01 '25 11:07 cjdoris

In fact I do wonder if that issue isn't a Julia bug (since we've never seen it anywhere besides the PythonCall.jl instance). Maybe it's due to how Python multithreading changed in 1.13 or something...

What changed?

I assumed there were some internal changes to enable the GIL free version: https://peps.python.org/pep-0703/. But actually I don't actually know if the regular GIL version had changes to enable this.

MilesCranmer avatar Jul 01 '25 11:07 MilesCranmer