pythonnet Random Access Violations on shutdown

Environment

Pythonnet version: 3.0.0.post1
Python version: 3.8.10
Operating System: Windows 11
.NET Runtime: .NET Framework 472

Details

Describe what you were trying to get done.

I am getting random access violations during shutdown in pythonnet. The issue can be re-produced with a C# function as simple as

public static int Add(int a, int b) => a + b;

I get the access violation; however, it is not deterministic at all. Some times, it succeeds as expected, other times I get this. Note that I am upgrading from pythonnet 2.5.2 to pythonnet 3.0.0.post1 so this code used to work. All I am doing is invoking this code from python. I am running this python code that uses pythonnet in pytest. I haven't seen this when we were using pythonnet 2.5.2.

If there was a crash, please include the traceback here.

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)

Oct 18 '22 00:10 rohanjain101

Can you please give the exact example that produces the error for you (Python and C# code)?

Oct 18 '22 06:10 filmor

We're using pythonnet in many different scenarios, and this issue seems to be happening randomly. This is the simplest example I can find that (occasionally) reproduces the failure

using System;
using System.Linq;

namespace Test.Local
{
    public static partial class Operations
    {
        public static decimal? explicit_cast_float128_Float128_Scalar(decimal value) =>
            (decimal?)value;
    }
}

# Python Code:
import clr
clr.AddReference(r"Test.Local.dll")
from Test.Local import Opertations
from System import Decimal
Operations.explicit_cast_float128_Float128_Scalar(Decimal(123))

Note that we are not able to consistently repro the issue. We are running a series of unit tests using pytest, and every run, we get different tests failing with this same stack trace. Is there something specific we could be doing in our code that may cause this crash in the new release? I ask because we're getting this issue when upgrading to pythonnet3.0.0post1, we're currently on pythonnet2.5.2 and haven't had this issue.

Oct 18 '22 16:10 rohanjain101

One major difference is that older versions of Python.NET didn't even try to properly clean up in the end. How are your tests set up that you get test failures due to shutdowns in them? Are they spinning up Python sub-processes? Can the issue be reproduced by just taking the example that you gave and running it a few thousand times or is this just an example from your test suite?

Oct 18 '22 17:10 filmor

This is an example from the test suite, I'll try running it a few thousand times to see if I can repro. We're just using pytest, AFAIK, it creates one python process, runs a set of tests sequentially, then terminates the python process. What's interesting is that the tests themselves pass, but when the python process is shutting down, that's where we start getting these random access violations.

Do we need to have certain cleanup code in between each test that runs in the process?

Oct 18 '22 20:10 rohanjain101

Still working on a repro outside of our unit tests, is there anything from the stack trace that may help you identify if there's something we're doing that may cause this? Pytest runs all the tests in one python process I confirmed that, so the shutdown should only be happening once, shouldn't be spinning up python sub-processes.

Oct 19 '22 23:10 rohanjain101

Pythonnet version: 3.0.1 Python: 3.8 and 3.11.

I encountered a crash with the same traceback as the first post when running unittests with Python Development Mode activated (-X dev)

I was able to reproduce this crash outside of my tests with the following sample:

import clr
clr.AddReference("System.Windows.Forms")
from System.Windows.Forms import Form

def example():  
  form = Form()
  form.Show()

example()

Save the above as issue1977.py and run with python -X dev issue1977.py on Windows 10 results in an access violation except.

In my case I am not using System.Windows.Form but a different assembly and once I construct a class from that assembly after importing then I get this issue.

Python 3.7 doesn't have experience the same problem.

Dec 01 '22 08:12 donno

Thanks @donno . Are you able to repro this crash deterministically or do you have the same issue as us where it seems to be happening randomly? @filmor is there anything we can do to avoid this crash?

Dec 01 '22 23:12 rohanjain101

For me, the above snippet is deterministic., it results in the System.AccessViolationException every time.

Dec 02 '22 11:12 donno

The actual issue is not the dev mode as such, but that it uses the debug mode for the allocator. If you just want to use all of the other warnings, you can explicitly force the non-debug one by setting the PYTHONMALLOC=default environment variable. I don't see the exact same backtrace, though, so my hunch is that this is a different bug:

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.Runtime.InteropServices.Marshal.ReadInt32(IntPtr ptr, Int32 ofs)
   at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)

Dec 02 '22 11:12 filmor

Ah, yes my traceback had the call to ReadInt32 rather than PyObject_TYPE from the first post.

Thanks for narrowing down my specific case and providing an alternative to still get the benefits I was aiming for with the dev mode.

Dec 03 '22 13:12 donno

@filmor I tried disabling the PYTHONMALLOC default but still get it (see below) Should I do something before the fix is released to avoid the issue?

Dec 06 '22 16:12 maxcapodi78

@filmor Ok I've realized that settings PYTHONMALLOC=malloc made the error disappear? So, maybe should be used this instead of default?

Dec 06 '22 17:12 maxcapodi78

None of these should be necessary, but running Python.NET with anything but the default allocation is undertested (as in, not at all). My guess is that somewhere in one of these debug hooks, some .NET code is evaluated and throws (across FFI boundaries). I'll have some time to debug this this weekend.

Dec 06 '22 20:12 filmor

@filmor just wondering if there's something we should do in the mean time to mitigate this issue?

Jan 06 '23 04:01 rohanjain101

I confirm I am also experiencing this error:

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)

Unfortunately I cannot provide you with the C# code that triggers the problem because I did not write it myself. My Python code that calls that code is very much similar to what @donno posted above.

I am running python 3.9.16 and pythonnet 3.0.1. The error does not manifest itself in version 2.5.2.

Feb 07 '23 18:02 jgmarcel

@filmor I was wondering if there's any update on this issue or if there's anything we can do to work around this issue. Would you suggest we force using the default allocator?

Mar 10 '23 20:03 rohanjain101

Should we maintain any hopes that this issue will be addressed anytime soon?

May 05 '23 19:05 jgmarcel

@jgmarcel we can not fix the issue we can not reproduce. Does your scenario prohibit use of malloc allocator as suggested above?

May 05 '23 19:05 lostmsu

@jgmarcel we can not fix the issue we can not reproduce. Does your scenario prohibit use of malloc allocator as suggested above?

Unfortunately it did not work:

Failed to shutdown pythonnet: System.NullReferenceException: Object reference not set to an instance of an object.
   at Python.Runtime.BorrowedReference.DangerousGetAddress()
   at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
   at Python.Runtime.BorrowedReference.DangerousGetAddress()
   at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Users\prodbwgi\Anaconda3\envs\bwgi\lib\site-packages\pythonnet\__init__.py", line 158, in unload
    raise RuntimeError("Failed to call Python.NET shutdown")
RuntimeError: Failed to call Python.NET shutdown

May 05 '23 20:05 jgmarcel

The error above was preceded by this one:

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)

I guess I am stuck to version 2.5.2?

May 05 '23 20:05 jgmarcel

@jgmarcel If you can't provide a reproducible example, we can't debug. It's as easy as that. I could reproduce the example from the start of the thread and figured out that it had something to do with allocator, using the malloc allocator made /that/ problem go away. It's entirely possible that your problem, while exhibiting a similar backtrace, is different.

May 06 '23 08:05 filmor

@filmor Sure thing. The C# code I am calling from Python is distributed in a DLL shipped with a commercial product, so I am afraid I cannot include it in my report. My Python code, on the other hand, is very simple and just calls the C# code as instructed by the software vendor. Is there anything else (an output, some profiling results, etc.) I could provide you with that would be useful?

May 08 '23 00:05 jgmarcel

This is a use-after-free; and switching the allocator only changes the likelyhood of the problem resulting in a crash.

I've debugged the case with -Xdev a bit, and what's happening is that a Python object is getting freed because its last reference is released by:

Python.Runtime.Runtime.XDecref(Python.Runtime.StolenReference)
Python.Runtime.Finalizer.DisposeAll()
Python.Runtime.Runtime.TryCollectingGarbage(Int32, Boolean)
Python.Runtime.Runtime.Shutdown()
Python.Runtime.PythonEngine.Shutdown()

Due to the use of a debug allocator, this overwrites the ob_type field of the freed object with a 0xdddddddd bit pattern.

After that, NullGCHandles(CLRObject.reflectedObjects); is called. But the reflectedObjects still contain a pointer to the deleted PyObject, so PyObject_TYPE is a use-after-free and retrieves the invalid PyTypeObject* 0xdddddddd. Attempting to dereference that to get at the tp_clr_inst field results in the crash:

   at System.Runtime.InteropServices.Marshal.ReadInt32(IntPtr ptr, Int32 ofs)
   at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()

With the default allocator (without -Xdev; i.e. pymalloc), I've also observed cases where PyObject_TYPE crashes directly, but those are more tricky to reproduce. That likely happens because the freed object was the last in its arena, so pymalloc returned the memory to the system. But it is likely that the same bug is the cause for both.

Switching to PYTHONMALLOC=malloc just means the memory is more likely to stay around long enough to avoid the crash, making the use-after-free somewhat harmless. But with the default pymalloc allocator, it's likely to cause real crashes; and with the debug allocator it's basically guaranteed to crash.

May 11 '23 13:05 dgrunwald

Would this be an appropriate fix?

diff --git a/src/runtime/Finalizer.cs b/src/runtime/Finalizer.cs
index 713564f..d6bbc3b 100644
--- a/src/runtime/Finalizer.cs
+++ b/src/runtime/Finalizer.cs
@@ -229,6 +229,7 @@ namespace Python.Runtime

                         IntPtr copyForException = obj.PyObj;
                         Runtime.XDecref(StolenReference.Take(ref obj.PyObj));
+                        CLRObject.reflectedObjects.Remove(copyForException);
                         collected++;
                         try
                         {
@@ -236,7 +237,7 @@ namespace Python.Runtime
                         }
                         catch (Exception e)
                         {
-                            HandleFinalizationException(obj.PyObj, e);
+                            HandleFinalizationException(copyForException, e);
                         }
                     }

The fix is simple: remove a possible reference to the currently disposed object from the reflectedObjects set. If the item does not exist the Remove is a no-op.

I am not entirely sure why a variable copyForException is not used in the catch block.

May 15 '23 09:05 siegfriedpammer

@siegfriedpammer could you create a PR with this fix? We have found a scenario on python 3.11 that hits this crash deterministically and could test that this fix resolves this issue.

May 18 '23 20:05 rohanjain101

I think, the above fix does only work if a PyObject only has one reference left. For my own use of the library I have just removed the offending call to NullGCHandles because the process using the library is cleaned up by the OS anyway. The .NET runtime dies when the Python runtime dies when the process dies.

But I hope this is fixed at some point by someone more knowledgeable about pythonnet so I don't have to patch the library again if I ever need to update. Thanks!

May 19 '23 04:05 siegfriedpammer

I see the same issue when I am running robot framework tests that use 3.0.1 version of pythonnet. When I downgrade I don't see the issue anymore. Will this be fixed soon?

Jun 02 '23 11:06 johnpp143

But the reflectedObjects still contain a pointer to the deleted PyObject

So this is not a trivial issue, and we need a reproducible example to debug on our side.

The thing is: if PyObject is alive and is referenced from .NET, it is present in reflectedObjects and also it has a GC handle pointing back to .NET, which means that the .NET object can not end up in the finalization queue, because there is a GC handle that references it.

The only exception from this rule I am aware of is Python types that are derived from .NET types. Under certain circumstances they change the GC handle type to weak reference, which allows it to be collected and placed into finalization queue. See https://github.com/pythonnet/pythonnet/blob/a1cd73fa5bd93962737449feea3bd425f37c42ba/src/runtime/Types/ClassDerived.cs#L87

There's a force break loops bit in Shutdown, but when it nulls GC handles it also cleans the reflectedObjects collection.

Jun 02 '23 18:06 lostmsu

The example from https://github.com/pythonnet/pythonnet/issues/1977#issuecomment-1333392419 is perfectly reproducible for us when using the debug allocator (-Xdev).

Jun 03 '23 06:06 dgrunwald

@lostmsu @filmor is there any reason for a manual shutdown/cleanup in the first place instead of relying on the OS as older versions of pythonnet did?

Jun 03 '23 18:06 rohanjain101