pyodbc icon indicating copy to clipboard operation
pyodbc copied to clipboard

Executing stored procedures using TVP hard crashes python GC in python 3.11

Open Sander-PGGM opened this issue 2 months ago • 5 comments

Hello,

Executing stored procedures using TVP hard crashes python GC in python 3.11. We think the issue has something to do with freeing memory inside cursor.execute(sql, *parameters). Which then gets double freed inside gc.collect().

Environment

  • Python: 3.11 (It seems to also occur in 3.13)
  • pyodbc: 5.3.0
  • DB: MS SQL Server 2022
  • Driver: Microsoft ODBC Driver 18 for SQL Server (18.5.1.1)

After writing using a stored procedure that uses a TVP to write data the python process crashes running gc.collect(). params contains regulair parameters and one set of TVP.

Example:

conn = pyodbc.connect(connection_string, attrs_before={1256: bytearray(token_struct)})
sql = f'EXEC [{schema}].[{procedure}]'

# Params contains the TPV values as `{data : (Tuple)}`
with self._connection.cursor() as cursor:
    if params:
        sql += ' ' + ', '.join(f'@{p}=?' for p in params.keys())
        asdf = tuple(params.values())
        cursor.execute(sql, asdf)
    else:
        cursor.execute(sql)
  gc.collect(0)
  gc.collect(1)
  gc.collect(2)

CPython crashes inside gc.collect(2) without any form of exception.

Result

Process finished with exit code -1073741819 (0xC0000005)

When GC is enabled, it crashes randomly. We think when garbage collection is triggered. With gc.disable() it doesn't crash.

After some debugging we found that it only happend with rows that contain None values where there should be a nullable float or decimal.

If you need any more information please let me know.

Sander-PGGM avatar Nov 12 '25 08:11 Sander-PGGM

Can you create a minimal reproducible example that reliably re-creates the issue? The CREATE TYPE and CREATE PROCEDURE statements don't necessarily need to be embedded the the Python code; they can be in a separate .sql file if they are lengthy.

gordthompson avatar Nov 12 '25 19:11 gordthompson

The following minimal reproducible example seems to work:

Python version: Python 3.9.13

import pyodbc
import gc
import faulthandler
faulthandler.enable()

server = "YOUR-SERVER-HERE"
database = "YOUR-DATABASE-HERE"
connection_string = (
    "Driver={ODBC Driver 18 for SQL Server};"
    f"Server={server};"
    f"Database={database};"
    "Authentication=ActiveDirectoryInteractive;"
)

amount_of_tests = 0
gc.disable()

while True:
    print(f"Test {amount_of_tests}.")
    try:
        connection = pyodbc.connect(connection_string)
        print("Connection established successfully.")
    except pyodbc.Error as e:
        print("Error connecting to database:", e)

    tvp: list[tuple] = [
        (1234, 'AA', 'Name_should_be_here', 2003, 1, 3012, 12, 0.35520718, 1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 64, 1.0, 1.0, 1.0, 1.0, 1.0, 0.35, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, -0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0),
        (4321, 'ZZ', 'Name_should_be_here', None, None, 3012, 12, 0.35520718, 1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 64, 1.0, 1.0, 1.0, 1.0, 1.0, 0.35, None, None, None, None, None, None, None, None, 0.0, 0.0, -0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
        ]
    
    params = (f"Correlatie_id_voorbeeld", 1188, tvp)
    with connection.cursor() as cursor:
        cursor.execute('EXEC [dbo].[tvpIssueDebug] @stringId=?, @integerId=?, @Data=?', params)

    gc.collect(0)
    gc.collect(1)
    gc.collect(2)
    amount_of_tests += 1

Result We observe the following result consistently. Please note that the crash seems to be due to undefined memory behaviour inside C++, therefor the time after which the crash occurs (and the amount of cursor.execute calls that are successful) seems to vary.

Furthermore, debug python flags seem to postpone or stop the crash, in some cases.

Test 0.
Connection established successfully.
Test 1.
Connection established successfully.
Windows fatal exception: access violation

Current thread 0x00001200 (most recent call first):
  File "L:\Repos\OVZ_shutdown\example.py", line 38 in <module>

You can find the CREATE TYPE and CREATE PROCEDURE in the following file: test_debug_pyodbc.sql

Some final remarks:

  • Smaller table types do not seem to have the same problem.
  • We observed this behaviour for floats and decimal.
  • Removing the gc.collect statements does not remove the crash. But makes it less reliable.

If you have any questions, please message me. Thank you for your time.

Sander-PGGM avatar Nov 13 '25 11:11 Sander-PGGM

Reproduced with

  • Python: 3.13
  • pyodbc: 5.3.0
  • DB: MS SQL Server 2019
  • Driver: Microsoft ODBC Driver 18 for SQL Server (18.5.2.1)

gordthompson avatar Nov 13 '25 15:11 gordthompson

Hello, thank you for maintaining this project and for your time.

We’ve been investigating the issue and suspect it may be related to a double free occurring in cursor.cpp within the block lines 840–884. Based on our review, this section handles cleanup logic for statement handles and bound columns, and it appears that under certain conditions, the same pointer might be released twice. This could explain the intermittent crashes we’re observing.

Could you provide an estimated timeline for when this might be addressed? We’re interested in contributing, but we’d appreciate some guidance on the best approach.

  • Are there any existing safeguards or patterns in the codebase we should align with?
  • Would adding additional reference counting or null checks be acceptable in this context?

Thank you for your support and for maintaining pyodbc!

Sander-PGGM avatar Nov 18 '25 09:11 Sander-PGGM

Hi @Sander-PGGM and @gordthompson ,

Just wanted to share some information with you that might be relevant to this issue, although it may be a secondary issue also. But I think it is important to share given the headache it caused me:

I have developed a Python based multi-threaded application that runs inside a commercial GIS (Geographical Information System) of ESRI (www.esri.com) called ArcGIS Pro, which uses configurable Conda environments. Python is used in ArcGIS Pro to automate geoprocessing workflows on cartographic data. I use it to process OpenStreetMap data on a large scale.

My Python code uses the Python 'concurrent.futures.ThreadPoolExecutor' to parallelize SQL queries against PostgreSQL, using either pyodbc or psycopg2 as database connector to speed up processing, which works really well with both pyodbc and psycopg2.

This code had been running fine in previous versions in ArcGIS Pro up to a version with Python 3.11.

Now the latest release of ArcGIS Pro had a number of changes:

  • It switched to Python 3.13 and associated updated Conda packages for pyodbc and psycopg2.
  • It introduced a new way of running Python geoprocessing workflows. Previously only running on a background geoprocessing thread was supported (a thread managed by ArcGIS Pro itself, not my Python code), allowing to keep interacting with the application while a long running geoprocessing workflow ran in the background. The new version of the software introduced geoprocessing on a "foreground thread", which is faster, but blocks the interface.

With the new "foreground thread", which is optional, my code suddenly started hanging, which doesn't happen on the "background geoprocessing thread" of ArcGIS Pro. I need to stress that I see this issue of hanging with both pyodbc and psycopg2 (the latter doesn't use ODBC but connects directly via libpq AFAIK), so it is not directly related to pyodbc.

This caused me tremendous headaches to figure out what was wrong. I noticed that when a thread pool was launched using Python 'concurrent.futures.ThreadPoolExecutor', all threads but the last one, would result in returning. The last one would get stuck though, for ever returning True on 'future.running()' with "foreground processing". No exception is thrown when this happens, and 'future.exception()' doesn't return anything either.

I had no idea why, but started to suspect a possible issue in either Pro or Python 3.13.

I then happened to hit on this issue which mentioned gc.collect() as being involved in issues.

So, after many other attempts and wasted hours trying to debug this issue, I decided to comment out all gc.collect() calls inside the functions that are run in the 'concurrent.futures.ThreadPoolExecutor' threads and input to the actual 'submit' calls.

Most of these calls, maybe unnecessary (although I have learned a lot over the years, I still do not yet consider myself a real Python expert), were in try...except...finally constructs inside the called function part of the 'submit' calls.

E.g. like:

try:
except:
finally:
    **other clean up operations**
    gc.collect()

After doing that, and re-running the code again in the "foreground thread", the code suddenly passed the point where it previously got stuck! I still have another issue to figure out, but this starts to suggest an issue in Python 3.13? gc.collect(), and maybe specifically related to the 'finally' construct, or some thread safety issue?

Looking through the Python 3.13 list of changes (I didn't check older versions), I see at least two issues referencing gc.collect() that might be relevant, see the above linked full release notes for the references to gc.collect() in relation to these issues:

  • https://github.com/python/cpython/issues/81010
  • https://github.com/python/cpython/issues/58581

At least one of them seems to have quite a bit of history, and may have seen changes in e.g. 3.11 as well, but I didn't check.

Anyway, not sure if it is really relevant, but it might be worth to keep in mind and consider a Python bug / issue instead of pyodbc. This issue here on the pyodbc GitHub issue tracker and its mention of gc.collect() at least helped me to get one major step further with my own upgrade issues, so thanks for documenting it here.

mboeringa avatar Nov 22 '25 20:11 mboeringa

I believe I have tracked down the cause of this bug. For NULL values GetParameterInfo() calls GetNullInfo(), and the first thing that function does is call GetParamType(). That function creates a dynamically-allocated array of parameter types in the cursor object, using that object's paramcount member to figure out how many elements the array will need. Works fine for top-level parameters, but trying to use that technique for nested parameters (that is, cells in a TVP) won't work, because:

  • if there are more columns in the TVP than there are top-level parameters (as in the case of this ticket's repro case), we end up writing to memory that doesn't belong to us, hence the crash; and
  • even if that's not true, we're overwriting the cached types of the top-level parameters

I've got a working patch, and I'll submit a PR as soon as

  • I get the green light from the maintainers to use the same fallback type (SQL_VARCHAR) as is already being used when the driver doesn't support SQLDescribeParam; and
  • I get the test(s) written.

bkline avatar Dec 13 '25 22:12 bkline

I noticed that test_sql_variant() in sqlserver_test.py is marked to be skipped if SQLSERVER_YEAR is earlier than 2000, but a similar version check isn't made for the existing TVP tests, even though that feature didn't show up until SQL SERVER 2008. Was that an oversight? I'll need to know for my own test.

Also, why is get_sqlserver_version() factored out, but not used by _get_sqlserver_year()?

bkline avatar Dec 14 '25 12:12 bkline

Hi @bkline , many thanks for looking into this issue, and well done for tracking down the cause.

Yes, please do create a PR if you're able to. @mkleehammer has the final say on C++ matters and I'm sure he will be interested in seeing it. For the fallback solution, SQL_VARCHAR does seem the most reasonable/straightforward option.

On the question of the tests in sqlserver_test.py, I can't explain why test_sql_variant() has a version check on it but test_tvp() and test_tvp_diffschema() do not. Feel free to add that check if you like.

As for get_sqlserver_version(), again I'm not sure why it is not used in the tests, and why _get_sqlserver_year() is used instead. I suspect this happened when the unit tests were converted to use pytest a couple of years ago.

keitherskine avatar Dec 14 '25 20:12 keitherskine

PR is in. I stumbled when I was trying to pare down the test to its absolute minimum, and tried setting the TVP value to a single row with NULLs, and that failed. After chasing that around for a while I came across the history for this known limitation. The only way I can see around that problem would be to scan the rows until a non-NULL value has been found for each column, and I gather that solution is off the table (presumably for legitimate performance reasons). And even that solution could fail in the worst case of a column for which no row had a non-NULL value. So I settled on using two rows, with the NULL values in the second row, and that does what we want.

I tested on MacOS, Linux, and Windows with several versions of Python, and with the fix the test always passes. Without the fix (as noted in the PR) the test sometimes passes and sometimes crashes Python. I was reluctant to sit in a long loop waiting for a crash—the test suite for this project has always been refreshingly quick (while still maintaining pretty good coverage), and I had no desire to mess with that sweet situation (besides, I found that sometimes I could sit in that loop for a very long time without a crash).

I did discover that the test will reliably crash Python every time if tests/__pycache__ is not there. I don't have any good theories about why that would be, but I learned long ago the first rule of heap corruption debugging: Never rule out any variable as a possible cause.

Of course, this bug doesn't just throw an exception, it actually shuts down Python, halting everything in its tracks. It seems reasonable to regard that as OK, given the expectation that if the PR gets approved, it's because we're confident that the crash won't ever happen again, unless there's a regression which restores the bug. If the maintainers disagree, we could always rewrite the test so that it launches a subprocess. I didn't want to introduce that complexity/performance tax unless it's really necessary.

[Edit: moved the comment below to the Discussions forum.]

bkline avatar Dec 15 '25 15:12 bkline