Random Access Violations on shutdown
Environment
- Pythonnet version: 3.0.0.post1
- Python version: 3.8.10
- Operating System: Windows 11
- .NET Runtime: .NET Framework 472
Details
-
Describe what you were trying to get done.
I am getting random access violations during shutdown in pythonnet. The issue can be re-produced with a C# function as simple as
public static int Add(int a, int b) => a + b;
I get the access violation; however, it is not deterministic at all. Some times, it succeeds as expected, other times I get this. Note that I am upgrading from pythonnet 2.5.2 to pythonnet 3.0.0.post1 so this code used to work. All I am doing is invoking this code from python. I am running this python code that uses pythonnet in pytest. I haven't seen this when we were using pythonnet 2.5.2.
- If there was a crash, please include the traceback here.
Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
at Python.Runtime.Runtime.Shutdown()
at Python.Runtime.PythonEngine.Shutdown()
at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
Can you please give the exact example that produces the error for you (Python and C# code)?
We're using pythonnet in many different scenarios, and this issue seems to be happening randomly. This is the simplest example I can find that (occasionally) reproduces the failure
using System;
using System.Linq;
namespace Test.Local
{
public static partial class Operations
{
public static decimal? explicit_cast_float128_Float128_Scalar(decimal value) =>
(decimal?)value;
}
}
# Python Code:
import clr
clr.AddReference(r"Test.Local.dll")
from Test.Local import Opertations
from System import Decimal
Operations.explicit_cast_float128_Float128_Scalar(Decimal(123))
Note that we are not able to consistently repro the issue. We are running a series of unit tests using pytest, and every run, we get different tests failing with this same stack trace. Is there something specific we could be doing in our code that may cause this crash in the new release? I ask because we're getting this issue when upgrading to pythonnet3.0.0post1, we're currently on pythonnet2.5.2 and haven't had this issue.
One major difference is that older versions of Python.NET didn't even try to properly clean up in the end. How are your tests set up that you get test failures due to shutdowns in them? Are they spinning up Python sub-processes? Can the issue be reproduced by just taking the example that you gave and running it a few thousand times or is this just an example from your test suite?
This is an example from the test suite, I'll try running it a few thousand times to see if I can repro. We're just using pytest, AFAIK, it creates one python process, runs a set of tests sequentially, then terminates the python process. What's interesting is that the tests themselves pass, but when the python process is shutting down, that's where we start getting these random access violations.
Do we need to have certain cleanup code in between each test that runs in the process?
Still working on a repro outside of our unit tests, is there anything from the stack trace that may help you identify if there's something we're doing that may cause this? Pytest runs all the tests in one python process I confirmed that, so the shutdown should only be happening once, shouldn't be spinning up python sub-processes.
Pythonnet version: 3.0.1 Python: 3.8 and 3.11.
I encountered a crash with the same traceback as the first post when running unittests with Python Development Mode activated (-X dev)
I was able to reproduce this crash outside of my tests with the following sample:
import clr
clr.AddReference("System.Windows.Forms")
from System.Windows.Forms import Form
def example():
form = Form()
form.Show()
example()
Save the above as issue1977.py and run with python -X dev issue1977.py on Windows 10 results in an access violation except.
In my case I am not using System.Windows.Form but a different assembly and once I construct a class from that assembly after importing then I get this issue.
Python 3.7 doesn't have experience the same problem.
Thanks @donno . Are you able to repro this crash deterministically or do you have the same issue as us where it seems to be happening randomly? @filmor is there anything we can do to avoid this crash?
For me, the above snippet is deterministic., it results in the System.AccessViolationException every time.
The actual issue is not the dev mode as such, but that it uses the debug mode for the allocator. If you just want to use all of the other warnings, you can explicitly force the non-debug one by setting the PYTHONMALLOC=default environment variable. I don't see the exact same backtrace, though, so my hunch is that this is a different bug:
Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at System.Runtime.InteropServices.Marshal.ReadInt32(IntPtr ptr, Int32 ofs)
at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
at Python.Runtime.Runtime.Shutdown()
at Python.Runtime.PythonEngine.Shutdown()
at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
Ah, yes my traceback had the call to ReadInt32 rather than PyObject_TYPE from the first post.
Thanks for narrowing down my specific case and providing an alternative to still get the benefits I was aiming for with the dev mode.
@filmor I tried disabling the PYTHONMALLOC default but still get it (see below) Should I do something before the fix is released to avoid the issue?

@filmor Ok I've realized that settings PYTHONMALLOC=malloc made the error disappear? So, maybe should be used this instead of default?
None of these should be necessary, but running Python.NET with anything but the default allocation is undertested (as in, not at all). My guess is that somewhere in one of these debug hooks, some .NET code is evaluated and throws (across FFI boundaries). I'll have some time to debug this this weekend.
@filmor just wondering if there's something we should do in the mean time to mitigate this issue?
I confirm I am also experiencing this error:
Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
at Python.Runtime.Runtime.Shutdown()
at Python.Runtime.PythonEngine.Shutdown()
at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
Unfortunately I cannot provide you with the C# code that triggers the problem because I did not write it myself. My Python code that calls that code is very much similar to what @donno posted above.
I am running python 3.9.16 and pythonnet 3.0.1. The error does not manifest itself in version 2.5.2.
@filmor I was wondering if there's any update on this issue or if there's anything we can do to work around this issue. Would you suggest we force using the default allocator?
Should we maintain any hopes that this issue will be addressed anytime soon?
@jgmarcel we can not fix the issue we can not reproduce. Does your scenario prohibit use of malloc allocator as suggested above?
@jgmarcel we can not fix the issue we can not reproduce. Does your scenario prohibit use of
mallocallocator as suggested above?
Unfortunately it did not work:
Failed to shutdown pythonnet: System.NullReferenceException: Object reference not set to an instance of an object.
at Python.Runtime.BorrowedReference.DangerousGetAddress()
at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
at Python.Runtime.Runtime.Shutdown()
at Python.Runtime.PythonEngine.Shutdown()
at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
at Python.Runtime.BorrowedReference.DangerousGetAddress()
at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
at Python.Runtime.Runtime.Shutdown()
at Python.Runtime.PythonEngine.Shutdown()
at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "C:\Users\prodbwgi\Anaconda3\envs\bwgi\lib\site-packages\pythonnet\__init__.py", line 158, in unload
raise RuntimeError("Failed to call Python.NET shutdown")
RuntimeError: Failed to call Python.NET shutdown
The error above was preceded by this one:
Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
at Python.Runtime.Runtime.Shutdown()
at Python.Runtime.PythonEngine.Shutdown()
at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
I guess I am stuck to version 2.5.2?
@jgmarcel If you can't provide a reproducible example, we can't debug. It's as easy as that. I could reproduce the example from the start of the thread and figured out that it had something to do with allocator, using the malloc allocator made /that/ problem go away. It's entirely possible that your problem, while exhibiting a similar backtrace, is different.
@filmor Sure thing. The C# code I am calling from Python is distributed in a DLL shipped with a commercial product, so I am afraid I cannot include it in my report. My Python code, on the other hand, is very simple and just calls the C# code as instructed by the software vendor. Is there anything else (an output, some profiling results, etc.) I could provide you with that would be useful?
This is a use-after-free; and switching the allocator only changes the likelyhood of the problem resulting in a crash.
I've debugged the case with -Xdev a bit, and what's happening is that a Python object is getting freed because its last reference is released by:
Python.Runtime.Runtime.XDecref(Python.Runtime.StolenReference)
Python.Runtime.Finalizer.DisposeAll()
Python.Runtime.Runtime.TryCollectingGarbage(Int32, Boolean)
Python.Runtime.Runtime.Shutdown()
Python.Runtime.PythonEngine.Shutdown()
Due to the use of a debug allocator, this overwrites the ob_type field of the freed object with a 0xdddddddd bit pattern.
After that, NullGCHandles(CLRObject.reflectedObjects); is called. But the reflectedObjects still contain a pointer to the deleted PyObject, so PyObject_TYPE is a use-after-free and retrieves the invalid PyTypeObject* 0xdddddddd. Attempting to dereference that to get at the tp_clr_inst field results in the crash:
at System.Runtime.InteropServices.Marshal.ReadInt32(IntPtr ptr, Int32 ofs)
at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
at Python.Runtime.Runtime.Shutdown()
at Python.Runtime.PythonEngine.Shutdown()
With the default allocator (without -Xdev; i.e. pymalloc), I've also observed cases where PyObject_TYPE crashes directly, but those are more tricky to reproduce. That likely happens because the freed object was the last in its arena, so pymalloc returned the memory to the system. But it is likely that the same bug is the cause for both.
Switching to PYTHONMALLOC=malloc just means the memory is more likely to stay around long enough to avoid the crash, making the use-after-free somewhat harmless. But with the default pymalloc allocator, it's likely to cause real crashes; and with the debug allocator it's basically guaranteed to crash.
Would this be an appropriate fix?
diff --git a/src/runtime/Finalizer.cs b/src/runtime/Finalizer.cs
index 713564f..d6bbc3b 100644
--- a/src/runtime/Finalizer.cs
+++ b/src/runtime/Finalizer.cs
@@ -229,6 +229,7 @@ namespace Python.Runtime
IntPtr copyForException = obj.PyObj;
Runtime.XDecref(StolenReference.Take(ref obj.PyObj));
+ CLRObject.reflectedObjects.Remove(copyForException);
collected++;
try
{
@@ -236,7 +237,7 @@ namespace Python.Runtime
}
catch (Exception e)
{
- HandleFinalizationException(obj.PyObj, e);
+ HandleFinalizationException(copyForException, e);
}
}
The fix is simple: remove a possible reference to the currently disposed object from the reflectedObjects set. If the item does not exist the Remove is a no-op.
I am not entirely sure why a variable copyForException is not used in the catch block.
@siegfriedpammer could you create a PR with this fix? We have found a scenario on python 3.11 that hits this crash deterministically and could test that this fix resolves this issue.
I think, the above fix does only work if a PyObject only has one reference left. For my own use of the library I have just removed the offending call to NullGCHandles because the process using the library is cleaned up by the OS anyway. The .NET runtime dies when the Python runtime dies when the process dies.
But I hope this is fixed at some point by someone more knowledgeable about pythonnet so I don't have to patch the library again if I ever need to update. Thanks!
I see the same issue when I am running robot framework tests that use 3.0.1 version of pythonnet. When I downgrade I don't see the issue anymore. Will this be fixed soon?
But the reflectedObjects still contain a pointer to the deleted PyObject
So this is not a trivial issue, and we need a reproducible example to debug on our side.
The thing is: if PyObject is alive and is referenced from .NET, it is present in reflectedObjects and also it has a GC handle pointing back to .NET, which means that the .NET object can not end up in the finalization queue, because there is a GC handle that references it.
The only exception from this rule I am aware of is Python types that are derived from .NET types. Under certain circumstances they change the GC handle type to weak reference, which allows it to be collected and placed into finalization queue. See https://github.com/pythonnet/pythonnet/blob/a1cd73fa5bd93962737449feea3bd425f37c42ba/src/runtime/Types/ClassDerived.cs#L87
There's a force break loops bit in Shutdown, but when it nulls GC handles it also cleans the reflectedObjects collection.
The example from https://github.com/pythonnet/pythonnet/issues/1977#issuecomment-1333392419 is perfectly reproducible for us when using the debug allocator (-Xdev).
@lostmsu @filmor is there any reason for a manual shutdown/cleanup in the first place instead of relying on the OS as older versions of pythonnet did?