pyodbc icon indicating copy to clipboard operation
pyodbc copied to clipboard

Passing None to SQL Server INSERTs drastically slows down inserts:

Open nelsonwcf opened this issue 5 years ago • 47 comments

Environment

To diagnose, we usually need to know the following, including version numbers. On Windows, be sure to specify 32-bit Python or 64-bit:

  • Python: 3.7.7
  • pyodbc: 4.0.30
  • OS: Windows 10 x64 [Version 10.0.18362.720]
  • DB: SQL Server 13.0.5698.0
  • driver: ODBC Driver 17 for SQL Server (2017.175.01.01)

Issue

There's a drastic slow down in parameterized inserts when None/Nan are present in the list, usually of two magnitudes.

I already browsed through through issue #213 but it's old and still open. I also read through https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters, which gave me the idea to replace the None in the first line with valid data, which somewhat fixes the issue (but forces me to run an update query at the end of the INSERT statement to fix the 'substitute' NULLs).

Also, I've tried pyodbc.set_none_binding(pyodbc.SQL_VARCHAR) as suggested in the Proposal 1 workaround but the current pyodbc doesn't recognize this a property or method.

What is the current status of the NULL/None issue when inserting in SQL Server? Are there any developments?

nelsonwcf avatar Apr 14 '20 17:04 nelsonwcf

Have you tried using setinputsizes to see if that helps improve performance?

gordthompson avatar Apr 14 '20 20:04 gordthompson

I didn't at first but tried your suggestion using the following map:

    csr.setinputsizes([
        (pyodbc.SQL_VARCHAR, 128, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_TYPE_TIMESTAMP, 7, 0)
        , (pyodbc.SQL_REAL, 4, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 512, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_SMALLINT, 2, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_BIT, 1, 0)
        , (pyodbc.SQL_BIT, 1, 0)
    ])

Unfortunately nothing changes. I'm currently using a workaround in which I make the first row of the data values all non-None. This increase the speed from around 70 records / second to 7-8k records / second. While I don't know the internals of pyodbc, seems like it uses the first column values to determine the data types; setting the setinputsizes beforehand has not effect.

nelsonwcf avatar Apr 15 '20 16:04 nelsonwcf

Can we assume that you're using fast_executemany = True?

gordthompson avatar Apr 15 '20 19:04 gordthompson

Yes.

    with pyodbc.connect(cnn_str) as cnn:
        with cnn.cursor() as csr:
            csr.fast_executemany = True
            csr.executemany(sql, data)

I'm also using the latest ODBC Driver 17 for SQL Server (but previously was using the 13 instead and both show the same issue).

nelsonwcf avatar Apr 15 '20 20:04 nelsonwcf

I could provide you the serialized object of the list I'm inserting in the SQL Server and the table structure, if this would help.

nelsonwcf avatar Apr 16 '20 20:04 nelsonwcf

Could you post an ODBC trace? Since you mention performance I guess there are lots of rows to insert but you can truncate the trace after it starts to become repetitive.

v-chojas avatar Apr 16 '20 20:04 v-chojas

Sure. Here they are. I'm sending you two log files generated by a sample of 1000 records.

  • In the file "SQL_with_None", I inserted the rows of the dataset as they were, without removing any Nones. It generated a 41k log file.

  • In the file "SQL_without_Nones", I manually changed all the Nones in the first record only to a valid value (for example, for varchar column, I replaced the None to a '_' character, a 0 for a smallint, a 0 for a bit, etc.). The generated log was 797 lines.

SQL_without_Nones.zip SQL_with_None.zip

If you want, I can provide with the dataset I used.

For some reason, not having Nones in the first record makes a huge difference.

nelsonwcf avatar Apr 17 '20 01:04 nelsonwcf

The first difference I notice is at line 608: "without_Nones" is calling SQLBindParameter with SQL_C_WCHAR while "with_None" is calling it with SQL_C_BINARY.

gordthompson avatar Apr 17 '20 10:04 gordthompson

Anything else I can provide that would you help your assessment? I've spend half and hour looking at the tracing files but they are beyond my ability to debug. If this is relevant, in this article https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters by Michael Kleehammer, he says that for SQL Server "None could be a NULL of any type. VARCHAR can be converted to most types, but SQL Server won't convert it to binary, so we get an error inserting into binary columns."

Could this be related?

nelsonwcf avatar Apr 20 '20 15:04 nelsonwcf

In fast_executemany mode, pyODBC attempts to insert as many rows at once as possible by allocating the entire (2-dimensional) array of parameters, binding them, converting the Python objects into the ODBC C representation needed by the driver, and then executing. Without Nones, the log shows a single execution with 1001 rows. With Nones, the types it detected for the first row are ultimately incorrect (as Gord above mentioned, it chose binary instead of character) and thus has to re-detect each time it finds a new type which doesn't match the previous ones for the column, so you get many smaller execution batches.

v-chojas avatar Apr 20 '20 16:04 v-chojas

@v-chojas - Does the fast_executemany code really ignore setinputsizes, at least with regard to dealing with null (None) values?

gordthompson avatar Apr 20 '20 16:04 gordthompson

Yes, it does ignore. I used the following code:

with pyodbc.connect(cnn_str) as cnn:
    with cnn.cursor() as csr:
        csr.setinputsizes([
            (pyodbc.SQL_VARCHAR, 128, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_TYPE_TIMESTAMP, None, 0)
            , (pyodbc.SQL_REAL, 4, 7)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 512, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_SMALLINT, None, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_BIT, None, 0)
        ])
        csr.fast_executemany = True
        csr.executemany(sql, data)

Works the same as without the .setinputsizes and I know the .setinputsizes are being read because if I set the precision to None it raises an exception pyodbc.Error: ('HY104', '[HY104] [Microsoft][ODBC Driver 17 for SQL Server]Invalid precision value (0) (SQLBindParameter)'

The table structure is as following (CREATE TABLE statement): CREATE TABLE [iu].[poc_tmp]( [iu_methodpath] varchar NULL, [iu_txnid] varchar NULL, [iu_accountnumber] varchar NULL, [iu_tmsid] varchar NULL, [iu_datetime] datetime2 NULL, [iu_duration] [real] NULL, [iu_market] varchar NULL, [iu_macaddress] varchar NULL, [iu_accesspoint] varchar NULL, [iu_controller] varchar NULL, [iu_resptype] varchar NULL, [iu_client] varchar NULL, [iu_cpid] [smallint] NULL, [iu_vodproductname] varchar NULL, [iu_vodprovidername] varchar NULL, [iu_language] [bit] NULL

Would the behavior explained by @v-chojas expected in the None situation? If so and as @gordthompson suggested, wouldn't set the types and sizes before the execution as the solution?

Based on the previous comments, this seems to be a bug indeed, right? Do you want me to provide to tracing of having the setinputsizes declared before the execution?

nelsonwcf avatar Apr 24 '20 04:04 nelsonwcf

Even if you use setinputsizes, which sets the SQL type, I suspect a None cannot be detected as any valid C type for the binding, so it causes the code to go through the redetect path. Yes, a trace with setinputsizes would be useful.

v-chojas avatar Apr 24 '20 14:04 v-chojas

Here is the log with setinputsizes (as I described above). Insertion of 999 records.

SQL.zip

nelsonwcf avatar Apr 28 '20 21:04 nelsonwcf

@v-chojas

One question: why we don't get any problems only if the first row has no Nones? If the second row has no Nones, it still impacts the entire insert flow. I mean, shouldn't pyodbc correctly binds all the fields after it finds a non-None value? Why does it keep re-detecting it even after it finds a valid record? Why is the first line special?

nelsonwcf avatar May 05 '20 16:05 nelsonwcf

pyODBC can only bind based on the information it has so far; if it was bound as C type X, then any further cells that can't be converted to that type will require redetecing and rebinding, which means submitting what it has found so far.

v-chojas avatar May 07 '20 22:05 v-chojas

Since pyODBC can't bind None because SQL Server has different types of NULLs for different types of fields, the alternative would be to tell pyodbc beforehand what the type is. Theoretically, the way to do that would be using .setinputsizes(). Since this doesn't work in practice, can this be considered bug to be worked in the future?

nelsonwcf avatar May 11 '20 21:05 nelsonwcf

@v-chojas - sqlext.h has mappings between C types and SQL types starting here:

https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/Windows/inc/sqlext.h#L548

If we provide the SQL types via setinputsizes then couldn't pyodbc look up the C types that way?

gordthompson avatar May 18 '20 19:05 gordthompson

The C type must match the Python type, because it determines the data format of the bound buffer, and that also explains why in fast executemany mode, pyODBC must scan the first row to find out what the Python type of each column is (and why Nones are such a problem.)

v-chojas avatar May 19 '20 15:05 v-chojas

pyODBC must scan the first row to find out what the Python type of each column is

No, not if I've already told it what to expect by calling setinputsizes. Sure, if you want to support a "zero configuration" option that auto-detects and auto-rebinds, that's fine. But if I explicitly tell you what the types are then you should take my word for it.

Or to put it another way, it sounds like the behaviour is to look at each column value in the current row and if it's None then assume that it is the same type as the corresponding column of the previous row. If so, then just use the information from setinputsizes to pre-load that list of column types instead of using the data values in the first row.

gordthompson avatar May 19 '20 17:05 gordthompson

setinputsizes is used for setting the SQL type, not the C type.

v-chojas avatar May 19 '20 17:05 v-chojas

setinputsizes is used for setting the SQL type, not the C type.

Yes, but from this:

https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/Windows/inc/sqlext.h#L548-L611

I was able to derive this:

SQL_type                           C_type                              Comment                              
---------------------------------  ----------------------------------  -------------------------------------
(SQL_BIGINT+SQL_SIGNED_OFFSET)     SQL_C_SBIGINT                       /* SIGNED BIGINT */                  
(SQL_BIGINT+SQL_UNSIGNED_OFFSET)   SQL_C_UBIGINT                       /* UNSIGNED BIGINT */                
(SQL_C_LONG+SQL_SIGNED_OFFSET)     SQL_C_SLONG                         /* SIGNED INTEGER  */                
(SQL_C_LONG+SQL_UNSIGNED_OFFSET)   SQL_C_ULONG                         /* UNSIGNED INTEGER*/                
(SQL_C_SHORT+SQL_SIGNED_OFFSET)    SQL_C_SSHORT                        /* SIGNED SMALLINT */                
(SQL_C_SHORT+SQL_UNSIGNED_OFFSET)  SQL_C_USHORT                        /* UNSIGNED SMALLINT*/               
(SQL_TINYINT+SQL_SIGNED_OFFSET)    SQL_C_STINYINT                      /* SIGNED TINYINT  */                
(SQL_TINYINT+SQL_UNSIGNED_OFFSET)  SQL_C_UTINYINT                      /* UNSIGNED TINYINT*/                
SQL_BINARY                         SQL_C_BINARY                                                             
SQL_BIT                            SQL_C_BIT                                                                
SQL_C_UBIGINT                      SQL_C_BOOKMARK                      /* BOOKMARK        */                
SQL_C_ULONG                        SQL_C_BOOKMARK                      /* BOOKMARK        */                
SQL_CHAR                           SQL_C_CHAR                          /* CHAR, VARCHAR, DECIMAL, NUMERIC */
SQL_DATE                           SQL_C_DATE                                                               
SQL_DOUBLE                         SQL_C_DOUBLE                        /* FLOAT, DOUBLE                */   
SQL_GUID                           SQL_C_GUID                                                               
SQL_INTEGER                        SQL_C_LONG                          /* INTEGER                      */   
SQL_INTERVAL_DAY                   SQL_C_INTERVAL_DAY                                                       
SQL_INTERVAL_DAY_TO_HOUR           SQL_C_INTERVAL_DAY_TO_HOUR                                               
SQL_INTERVAL_DAY_TO_MINUTE         SQL_C_INTERVAL_DAY_TO_MINUTE                                             
SQL_INTERVAL_DAY_TO_SECOND         SQL_C_INTERVAL_DAY_TO_SECOND                                             
SQL_INTERVAL_HOUR                  SQL_C_INTERVAL_HOUR                                                      
SQL_INTERVAL_HOUR_TO_MINUTE        SQL_C_INTERVAL_HOUR_TO_MINUTE                                            
SQL_INTERVAL_HOUR_TO_SECOND        SQL_C_INTERVAL_HOUR_TO_SECOND                                            
SQL_INTERVAL_MINUTE                SQL_C_INTERVAL_MINUTE                                                    
SQL_INTERVAL_MINUTE_TO_SECOND      SQL_C_INTERVAL_MINUTE_TO_SECOND                                          
SQL_INTERVAL_MONTH                 SQL_C_INTERVAL_MONTH                                                     
SQL_INTERVAL_SECOND                SQL_C_INTERVAL_SECOND                                                    
SQL_INTERVAL_YEAR                  SQL_C_INTERVAL_YEAR                                                      
SQL_INTERVAL_YEAR_TO_MONTH         SQL_C_INTERVAL_YEAR_TO_MONTH                                             
SQL_NUMERIC                        SQL_C_NUMERIC                                                            
SQL_REAL                           SQL_C_FLOAT                         /* REAL                         */   
SQL_SMALLINT                       SQL_C_SHORT                         /* SMALLINT                     */   
SQL_TIME                           SQL_C_TIME                                                               
SQL_TIMESTAMP                      SQL_C_TIMESTAMP                                                          
SQL_TINYINT                        SQL_C_TINYINT                                                            
SQL_TYPE_DATE                      SQL_C_TYPE_DATE                                                          
SQL_TYPE_TIME                      SQL_C_TYPE_TIME                                                          
SQL_TYPE_TIME_WITH_TIMEZONE        SQL_C_TYPE_TIME_WITH_TIMEZONE                                            
SQL_TYPE_TIMESTAMP                 SQL_C_TYPE_TIMESTAMP                                                     
SQL_TYPE_TIMESTAMP_WITH_TIMEZONE   SQL_C_TYPE_TIMESTAMP_WITH_TIMEZONE

Not sure about the ones with the brackets, but the most common players are present and there are no duplicates in either the "SQL_type" or "C_type" columns.

gordthompson avatar May 19 '20 19:05 gordthompson

I would go even further and suggest that if the parameter classes in the provided Python parameters are not the same (for each field, ignoring None), then executemany should error out immediately.

params = [ ('A', 1), ('B', 2  ) ]  # OK, all field types are the same (or None)
params = [ ('A', 1), ('B', 2.0) ]  # not OK because 1 is an int and 2.0 is a float

It doesn't seem unreasonable to insist the parameter types for each field are the same.

keitherskine avatar May 19 '20 20:05 keitherskine

The C type needs to match the Python type, or it will result in incorrect data inserted; you've only found the default conversions but the driver can do others.

Python is a dynamic language, so it's absolutely normal to have heterogeneous types within a single column as long as they are convertible.

v-chojas avatar May 21 '20 14:05 v-chojas

@nelsonwcf - I created an MCVE hoping to reproduce the issue but it didn't. The following code inserts 1_000_000 rows in 53 seconds, regardless of whether the None is in the first or second row.

import os
import time

import pyodbc

cnxn = pyodbc.connect(
    "DRIVER=ODBC Driver 17 for SQL Server;"
    "SERVER=192.168.0.199;"
    "DATABASE=mydb;"
    "UseFMTONLY=yes;"
    f"UID=sa;PWD={os.getenv('sa_PWD')};",
    autocommit=True,
)
crsr = cnxn.cursor()
crsr.execute("CREATE TABLE #tmp (id int PRIMARY KEY, txt varchar(50))")

rows_to_insert = 1_000_000
row_index_for_null = 0
row_data = [
    (x, (None if x == row_index_for_null else f"row{x}"),)
    for x in range(rows_to_insert)
]

sql = "INSERT INTO #tmp (id, txt) VALUES (?, ?)"
crsr.fast_executemany = True
t0 = time.time()
crsr.executemany(sql, row_data)
print(f"{rows_to_insert:,} rows inserted in {(time.time() - t0):0.1f} seconds")

gordthompson avatar May 24 '20 13:05 gordthompson

Yes, I tested and confirm that your code didn't have any of the issues even though the first line has a 'None' Maybe there's an additional hidden criteria that is required for the issue to show? I will prepare a small dataset that shows the problem and share it here (sample from the one I used in my job).

nelsonwcf avatar May 28 '20 14:05 nelsonwcf

I think the best way to avoid guessing what are the requirements for the issue to show is to use the original file that created the problem. I'm attaching a 999 records CSV, the one I used to generate the ODBC tracing log files. It would take 1 second to insert all records when no Nones were present in the first line, and around 20 seconds otherwise.

Maybe this should help? sample_test.zip

nelsonwcf avatar Jun 02 '20 21:06 nelsonwcf

@nelsonwcf – Thanks for posting the sample data. On reviewing the issue I noticed two things:

  1. Your original post talks about "when None/Nan are present", and the mention of NaN suggests that you may have been trying to use pandas to_sql to upload the data. Is that the case?

  2. In your follow-up post you provide the DDL for the table, which begins with CREATE TABLE [iu].[poc_tmp]. Is [iu] the default schema for the user under which the insert code is running? We've seen other cases of things getting a bit weird when working with an object that does not belong to the default schema. #595 involves TVPs, but TVPs and .executemany use the same approach for packing data values into a parameter array (ref: #601).

gordthompson avatar Jun 02 '20 23:06 gordthompson

  1. That was when I found the issue. I noticed the problem using pandas, adding the fast_executemany on the SQLAlchemy engine. After I noticed the problem, however, I stopped using pandas with SQL Alchemy altogether.

  2. Yes, I did execute the insert statement using the [iu] schema. The default, dbo, is reserved on the SQL Server I use. But, if that was the case, wouldn't your code to insert 1M records also shows the issue since I modified it to use the iu schema? I needed to modify it to be able to test on the same server where I noticed the problem.

nelsonwcf avatar Jun 03 '20 03:06 nelsonwcf

I am able to sort-of-reproduce your issue, although in my case the performance penalty for null(s) in the first row is more like a factor of 2, not 20:

sample_test_no_null_first_line.csv

  • 999 rows inserted in 0.9 seconds
  • 646 packets, 563019 bytes

sample_test.csv

  • 999 rows inserted in 1.7 seconds
  • 922 packets, 585162 bytes

The "packets" and "bytes" numbers come from Wireshark. Interesting that the second case is only generating 4% more network traffic, but it's taking 43% more packets to get the job done.

gordthompson avatar Jun 03 '20 13:06 gordthompson