pyodbc
pyodbc copied to clipboard
Passing None to SQL Server INSERTs drastically slows down inserts:
Environment
To diagnose, we usually need to know the following, including version numbers. On Windows, be sure to specify 32-bit Python or 64-bit:
- Python: 3.7.7
- pyodbc: 4.0.30
- OS: Windows 10 x64 [Version 10.0.18362.720]
- DB: SQL Server 13.0.5698.0
- driver: ODBC Driver 17 for SQL Server (2017.175.01.01)
Issue
There's a drastic slow down in parameterized inserts when None/Nan are present in the list, usually of two magnitudes.
I already browsed through through issue #213 but it's old and still open. I also read through https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters, which gave me the idea to replace the None in the first line with valid data, which somewhat fixes the issue (but forces me to run an update query at the end of the INSERT statement to fix the 'substitute' NULLs).
Also, I've tried pyodbc.set_none_binding(pyodbc.SQL_VARCHAR) as suggested in the Proposal 1 workaround but the current pyodbc doesn't recognize this a property or method.
What is the current status of the NULL/None issue when inserting in SQL Server? Are there any developments?
Have you tried using setinputsizes to see if that helps improve performance?
I didn't at first but tried your suggestion using the following map:
csr.setinputsizes([
(pyodbc.SQL_VARCHAR, 128, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_TYPE_TIMESTAMP, 7, 0)
, (pyodbc.SQL_REAL, 4, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 512, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_SMALLINT, 2, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_BIT, 1, 0)
, (pyodbc.SQL_BIT, 1, 0)
])
Unfortunately nothing changes. I'm currently using a workaround in which I make the first row of the data values all non-None. This increase the speed from around 70 records / second to 7-8k records / second. While I don't know the internals of pyodbc, seems like it uses the first column values to determine the data types; setting the setinputsizes beforehand has not effect.
Can we assume that you're using fast_executemany = True?
Yes.
with pyodbc.connect(cnn_str) as cnn:
with cnn.cursor() as csr:
csr.fast_executemany = True
csr.executemany(sql, data)
I'm also using the latest ODBC Driver 17 for SQL Server (but previously was using the 13 instead and both show the same issue).
I could provide you the serialized object of the list I'm inserting in the SQL Server and the table structure, if this would help.
Could you post an ODBC trace? Since you mention performance I guess there are lots of rows to insert but you can truncate the trace after it starts to become repetitive.
Sure. Here they are. I'm sending you two log files generated by a sample of 1000 records.
-
In the file "SQL_with_None", I inserted the rows of the dataset as they were, without removing any Nones. It generated a 41k log file.
-
In the file "SQL_without_Nones", I manually changed all the Nones in the first record only to a valid value (for example, for varchar column, I replaced the None to a '_' character, a 0 for a smallint, a 0 for a bit, etc.). The generated log was 797 lines.
SQL_without_Nones.zip SQL_with_None.zip
If you want, I can provide with the dataset I used.
For some reason, not having Nones in the first record makes a huge difference.
The first difference I notice is at line 608: "without_Nones" is calling SQLBindParameter with SQL_C_WCHAR while "with_None" is calling it with SQL_C_BINARY.
Anything else I can provide that would you help your assessment? I've spend half and hour looking at the tracing files but they are beyond my ability to debug. If this is relevant, in this article https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters by Michael Kleehammer, he says that for SQL Server "None could be a NULL of any type. VARCHAR can be converted to most types, but SQL Server won't convert it to binary, so we get an error inserting into binary columns."
Could this be related?
In fast_executemany mode, pyODBC attempts to insert as many rows at once as possible by allocating the entire (2-dimensional) array of parameters, binding them, converting the Python objects into the ODBC C representation needed by the driver, and then executing. Without Nones, the log shows a single execution with 1001 rows. With Nones, the types it detected for the first row are ultimately incorrect (as Gord above mentioned, it chose binary instead of character) and thus has to re-detect each time it finds a new type which doesn't match the previous ones for the column, so you get many smaller execution batches.
@v-chojas - Does the fast_executemany code really ignore setinputsizes, at least with regard to dealing with null (None) values?
Yes, it does ignore. I used the following code:
with pyodbc.connect(cnn_str) as cnn:
with cnn.cursor() as csr:
csr.setinputsizes([
(pyodbc.SQL_VARCHAR, 128, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_TYPE_TIMESTAMP, None, 0)
, (pyodbc.SQL_REAL, 4, 7)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 512, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_SMALLINT, None, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_VARCHAR, 64, 0)
, (pyodbc.SQL_BIT, None, 0)
])
csr.fast_executemany = True
csr.executemany(sql, data)
Works the same as without the .setinputsizes and I know the .setinputsizes are being read because if I set the precision to None it raises an exception pyodbc.Error: ('HY104', '[HY104] [Microsoft][ODBC Driver 17 for SQL Server]Invalid precision value (0) (SQLBindParameter)'
The table structure is as following (CREATE TABLE statement): CREATE TABLE [iu].[poc_tmp]( [iu_methodpath] varchar NULL, [iu_txnid] varchar NULL, [iu_accountnumber] varchar NULL, [iu_tmsid] varchar NULL, [iu_datetime] datetime2 NULL, [iu_duration] [real] NULL, [iu_market] varchar NULL, [iu_macaddress] varchar NULL, [iu_accesspoint] varchar NULL, [iu_controller] varchar NULL, [iu_resptype] varchar NULL, [iu_client] varchar NULL, [iu_cpid] [smallint] NULL, [iu_vodproductname] varchar NULL, [iu_vodprovidername] varchar NULL, [iu_language] [bit] NULL
Would the behavior explained by @v-chojas expected in the None situation? If so and as @gordthompson suggested, wouldn't set the types and sizes before the execution as the solution?
Based on the previous comments, this seems to be a bug indeed, right? Do you want me to provide to tracing of having the setinputsizes declared before the execution?
Even if you use setinputsizes, which sets the SQL type, I suspect a None cannot be detected as any valid C type for the binding, so it causes the code to go through the redetect path. Yes, a trace with setinputsizes would be useful.
@v-chojas
One question: why we don't get any problems only if the first row has no Nones? If the second row has no Nones, it still impacts the entire insert flow. I mean, shouldn't pyodbc correctly binds all the fields after it finds a non-None value? Why does it keep re-detecting it even after it finds a valid record? Why is the first line special?
pyODBC can only bind based on the information it has so far; if it was bound as C type X, then any further cells that can't be converted to that type will require redetecing and rebinding, which means submitting what it has found so far.
Since pyODBC can't bind None because SQL Server has different types of NULLs for different types of fields, the alternative would be to tell pyodbc beforehand what the type is. Theoretically, the way to do that would be using .setinputsizes(). Since this doesn't work in practice, can this be considered bug to be worked in the future?
@v-chojas - sqlext.h has mappings between C types and SQL types starting here:
https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/Windows/inc/sqlext.h#L548
If we provide the SQL types via setinputsizes then couldn't pyodbc look up the C types that way?
The C type must match the Python type, because it determines the data format of the bound buffer, and that also explains why in fast executemany mode, pyODBC must scan the first row to find out what the Python type of each column is (and why Nones are such a problem.)
pyODBC must scan the first row to find out what the Python type of each column is
No, not if I've already told it what to expect by calling setinputsizes. Sure, if you want to support a "zero configuration" option that auto-detects and auto-rebinds, that's fine. But if I explicitly tell you what the types are then you should take my word for it.
Or to put it another way, it sounds like the behaviour is to look at each column value in the current row and if it's None then assume that it is the same type as the corresponding column of the previous row. If so, then just use the information from setinputsizes to pre-load that list of column types instead of using the data values in the first row.
setinputsizes is used for setting the SQL type, not the C type.
setinputsizesis used for setting the SQL type, not the C type.
Yes, but from this:
https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/Windows/inc/sqlext.h#L548-L611
I was able to derive this:
SQL_type C_type Comment
--------------------------------- ---------------------------------- -------------------------------------
(SQL_BIGINT+SQL_SIGNED_OFFSET) SQL_C_SBIGINT /* SIGNED BIGINT */
(SQL_BIGINT+SQL_UNSIGNED_OFFSET) SQL_C_UBIGINT /* UNSIGNED BIGINT */
(SQL_C_LONG+SQL_SIGNED_OFFSET) SQL_C_SLONG /* SIGNED INTEGER */
(SQL_C_LONG+SQL_UNSIGNED_OFFSET) SQL_C_ULONG /* UNSIGNED INTEGER*/
(SQL_C_SHORT+SQL_SIGNED_OFFSET) SQL_C_SSHORT /* SIGNED SMALLINT */
(SQL_C_SHORT+SQL_UNSIGNED_OFFSET) SQL_C_USHORT /* UNSIGNED SMALLINT*/
(SQL_TINYINT+SQL_SIGNED_OFFSET) SQL_C_STINYINT /* SIGNED TINYINT */
(SQL_TINYINT+SQL_UNSIGNED_OFFSET) SQL_C_UTINYINT /* UNSIGNED TINYINT*/
SQL_BINARY SQL_C_BINARY
SQL_BIT SQL_C_BIT
SQL_C_UBIGINT SQL_C_BOOKMARK /* BOOKMARK */
SQL_C_ULONG SQL_C_BOOKMARK /* BOOKMARK */
SQL_CHAR SQL_C_CHAR /* CHAR, VARCHAR, DECIMAL, NUMERIC */
SQL_DATE SQL_C_DATE
SQL_DOUBLE SQL_C_DOUBLE /* FLOAT, DOUBLE */
SQL_GUID SQL_C_GUID
SQL_INTEGER SQL_C_LONG /* INTEGER */
SQL_INTERVAL_DAY SQL_C_INTERVAL_DAY
SQL_INTERVAL_DAY_TO_HOUR SQL_C_INTERVAL_DAY_TO_HOUR
SQL_INTERVAL_DAY_TO_MINUTE SQL_C_INTERVAL_DAY_TO_MINUTE
SQL_INTERVAL_DAY_TO_SECOND SQL_C_INTERVAL_DAY_TO_SECOND
SQL_INTERVAL_HOUR SQL_C_INTERVAL_HOUR
SQL_INTERVAL_HOUR_TO_MINUTE SQL_C_INTERVAL_HOUR_TO_MINUTE
SQL_INTERVAL_HOUR_TO_SECOND SQL_C_INTERVAL_HOUR_TO_SECOND
SQL_INTERVAL_MINUTE SQL_C_INTERVAL_MINUTE
SQL_INTERVAL_MINUTE_TO_SECOND SQL_C_INTERVAL_MINUTE_TO_SECOND
SQL_INTERVAL_MONTH SQL_C_INTERVAL_MONTH
SQL_INTERVAL_SECOND SQL_C_INTERVAL_SECOND
SQL_INTERVAL_YEAR SQL_C_INTERVAL_YEAR
SQL_INTERVAL_YEAR_TO_MONTH SQL_C_INTERVAL_YEAR_TO_MONTH
SQL_NUMERIC SQL_C_NUMERIC
SQL_REAL SQL_C_FLOAT /* REAL */
SQL_SMALLINT SQL_C_SHORT /* SMALLINT */
SQL_TIME SQL_C_TIME
SQL_TIMESTAMP SQL_C_TIMESTAMP
SQL_TINYINT SQL_C_TINYINT
SQL_TYPE_DATE SQL_C_TYPE_DATE
SQL_TYPE_TIME SQL_C_TYPE_TIME
SQL_TYPE_TIME_WITH_TIMEZONE SQL_C_TYPE_TIME_WITH_TIMEZONE
SQL_TYPE_TIMESTAMP SQL_C_TYPE_TIMESTAMP
SQL_TYPE_TIMESTAMP_WITH_TIMEZONE SQL_C_TYPE_TIMESTAMP_WITH_TIMEZONE
Not sure about the ones with the brackets, but the most common players are present and there are no duplicates in either the "SQL_type" or "C_type" columns.
I would go even further and suggest that if the parameter classes in the provided Python parameters are not the same (for each field, ignoring None), then executemany should error out immediately.
params = [ ('A', 1), ('B', 2 ) ] # OK, all field types are the same (or None)
params = [ ('A', 1), ('B', 2.0) ] # not OK because 1 is an int and 2.0 is a float
It doesn't seem unreasonable to insist the parameter types for each field are the same.
The C type needs to match the Python type, or it will result in incorrect data inserted; you've only found the default conversions but the driver can do others.
Python is a dynamic language, so it's absolutely normal to have heterogeneous types within a single column as long as they are convertible.
@nelsonwcf - I created an MCVE hoping to reproduce the issue but it didn't. The following code inserts 1_000_000 rows in 53 seconds, regardless of whether the None is in the first or second row.
import os
import time
import pyodbc
cnxn = pyodbc.connect(
"DRIVER=ODBC Driver 17 for SQL Server;"
"SERVER=192.168.0.199;"
"DATABASE=mydb;"
"UseFMTONLY=yes;"
f"UID=sa;PWD={os.getenv('sa_PWD')};",
autocommit=True,
)
crsr = cnxn.cursor()
crsr.execute("CREATE TABLE #tmp (id int PRIMARY KEY, txt varchar(50))")
rows_to_insert = 1_000_000
row_index_for_null = 0
row_data = [
(x, (None if x == row_index_for_null else f"row{x}"),)
for x in range(rows_to_insert)
]
sql = "INSERT INTO #tmp (id, txt) VALUES (?, ?)"
crsr.fast_executemany = True
t0 = time.time()
crsr.executemany(sql, row_data)
print(f"{rows_to_insert:,} rows inserted in {(time.time() - t0):0.1f} seconds")
Yes, I tested and confirm that your code didn't have any of the issues even though the first line has a 'None' Maybe there's an additional hidden criteria that is required for the issue to show? I will prepare a small dataset that shows the problem and share it here (sample from the one I used in my job).
I think the best way to avoid guessing what are the requirements for the issue to show is to use the original file that created the problem. I'm attaching a 999 records CSV, the one I used to generate the ODBC tracing log files. It would take 1 second to insert all records when no Nones were present in the first line, and around 20 seconds otherwise.
Maybe this should help? sample_test.zip
@nelsonwcf – Thanks for posting the sample data. On reviewing the issue I noticed two things:
-
Your original post talks about "when None/Nan are present", and the mention of
NaNsuggests that you may have been trying to use pandasto_sqlto upload the data. Is that the case? -
In your follow-up post you provide the DDL for the table, which begins with
CREATE TABLE [iu].[poc_tmp]. Is[iu]the default schema for the user under which the insert code is running? We've seen other cases of things getting a bit weird when working with an object that does not belong to the default schema. #595 involves TVPs, but TVPs and.executemanyuse the same approach for packing data values into a parameter array (ref: #601).
-
That was when I found the issue. I noticed the problem using pandas, adding the fast_executemany on the SQLAlchemy engine. After I noticed the problem, however, I stopped using pandas with SQL Alchemy altogether.
-
Yes, I did execute the insert statement using the [iu] schema. The default, dbo, is reserved on the SQL Server I use. But, if that was the case, wouldn't your code to insert 1M records also shows the issue since I modified it to use the iu schema? I needed to modify it to be able to test on the same server where I noticed the problem.
I am able to sort-of-reproduce your issue, although in my case the performance penalty for null(s) in the first row is more like a factor of 2, not 20:
sample_test_no_null_first_line.csv
- 999 rows inserted in 0.9 seconds
- 646 packets, 563019 bytes
sample_test.csv
- 999 rows inserted in 1.7 seconds
- 922 packets, 585162 bytes
The "packets" and "bytes" numbers come from Wireshark. Interesting that the second case is only generating 4% more network traffic, but it's taking 43% more packets to get the job done.