pyodbc
pyodbc copied to clipboard
Need a way to define data type/length when passing variables in SQL Server
I'm doing selects/inserts/updates on my tables and SQL Server is creating an execution plan for every permutation of character lengths that the selects/inserts/updates into the table from python.
I believe this is causing some issues with the performance of the db and may create some memory issues.
I would like to be able to set the data type/length when passing parameters. Ideally, pyodbc would know the datatype/length before inserting and set them accordingly but I don't know if this will create more issues issues than it will solve, so being able to define them myself might help.
More specifically, when I run a trace, this is what I see:
declare @p1 int
set @p1=4882452
exec sp_prepexec @p1 output,N'@P1 nvarchar(11),@P2 nvarchar(16)',N'SELECT TOP 1 C_KEY FROM C WHERE A_NM = @P1 AND C_NM = @P2',N'Zantesuken1',N'RasalhagueLegacy'
select @p1
and when looking at the execution plans, I find that an execution plan has been created for every single permutation of these selects.
Perhaps something like this:
cursor.execute("SELECT X FROM Y WHERE Z1 = ? AND Z2 = ?", param1.type(varchar(50)), param2.type(bigint))
No idea if this makes sense and maybe this is already possible somehow and I don't know about it, so let me know if there's a way around this issue.
edit Also noticed that the auto-incrementing @p1 seems to be going up really quickly (I'm doing a lot of transactions). Is there any way to reset this counter? I've tried deleting the cursor, closing the connection, but nothing seems to be working.
Interesting. I guess it makes sense if the target size will matter to the execution plan. I need to look into this for one of my own projects.
Is it possible to test the impact using casts? Something like Z1 = cast(? as varchar(50)).
As for @p1, I actually don't know why the driver does that, but I don't believe it will be a problem. I have some 24x7 applications using pyodbc pretty heavily and they usually open a connection and leave it open for weeks or months - they reset when there is an error or it is restarted. I haven't noticed anything related to it.
@ZaM-Rai re: closing connection - The ODBC infrastructure on Windows automatically enables connection pooling for the SQL Server ODBC drivers, so you could try pyodbc.pooling = False before opening your connection(s), or even disable connection pooling for your SQL Server ODBC driver via the ODBC Administrator (odbcad32.exe).
Having a way to specify - or at least "hint" at - parameter types and sizes (as can be done with .Parameters.Add() in ADO.NET) might also help with issues like #212.
Not the most sophisticated test, but it does seem to indicate that different parameter lengths do produce multiple cache entries for the same command text:
import pyodbc
conn = pyodbc.connect("DSN=myDb")
def check_cache():
sql = """\
SELECT *
FROM sys.dm_exec_cached_plans AS p
CROSS APPLY sys.dm_exec_sql_text(p.plan_handle) AS s
WHERE s.text LIKE '(@p1 nvarchar%';
"""
cache_crsr = conn.cursor()
n = 0
for row in cache_crsr.execute(sql):
print(row.text)
n += 1
print(f"----- cache entries found: {n} -----")
cache_crsr.close()
crsr = conn.cursor()
print("clearing cache ...")
crsr.execute("DBCC FREEPROCCACHE")
check_cache()
crsr.execute("SELECT * FROM dbo.Contacts WHERE FirstName LIKE ?", "g%").fetchall()
check_cache()
crsr.execute("SELECT * FROM dbo.Contacts WHERE FirstName LIKE ?", "gord%").fetchall()
check_cache()
crsr.close()
conn.close()
Console output:
clearing cache ...
----- cache entries found: 0 -----
(@P1 nvarchar(2))SELECT * FROM dbo.Contacts WHERE FirstName LIKE @P1
----- cache entries found: 1 -----
(@P1 nvarchar(5))SELECT * FROM dbo.Contacts WHERE FirstName LIKE @P1
(@P1 nvarchar(2))SELECT * FROM dbo.Contacts WHERE FirstName LIKE @P1
----- cache entries found: 2 -----
@mkleehammer Hey, thanks for the reply, casting did not seem to do anything unfortunately. As for the @p1, I reached about 30M in about 15 hours, so I don't know if that's growing fast enough to become an issue, but if you're saying that it won't be a problem then I won't worry too much about it.
Also, turning off connection pooling did not seem to have any effect on the @p1 counter.
More research: https://msdn.microsoft.com/en-us/library/cc293623.aspx
I want this to be as automatic as possible, so I don't want the default solution to require user input.
I can think of some options:
- ask the driver for the right length - requires a server round trip so only good for prepared statements
- supply a rounded length to bucket plans, such as (length + 1024) % 1024
- supply defaults such as hard coded or maximums
First I'd like to not prepare a statement until the 2nd use (configurable counter). This will give an immediate performance boost on most databases. (I think SQL Server may buffer the round trip when possible, so less of a boost there. If inserting NULL I have to query the type anyway.)
Then I think I'd like to implement the rounding with configurable boundaries.
Finally, try asking for lengths when preparing to see if it is worth it.
My only concern about rounding is how it might affect plan generation. I don't think it should but I need to see what statistics each database is keeping.
What does everyone think about this?
It seems like I'm getting another performance issue related to this; My table has bigint values in the column definition, but when the function is called in pyodbc, it is is originally defined as an int and the explain plan is telling me there is an implicit conversion to bigint, which could be impacting my performance.
It seems like these issues are related and in both cases seems to be non-negligibly impacting performance.
Are there any new developments on this?
I'm aware of the int / bigint, but I thought it was behaving like the 3.x version.
We don't know the target type for parameters without an execution plan compile and query, so we use int when the value fits ODBC's definition of int and bigint when bigger. The first 4.0 used bigint exclusively but some databases don't even support it so I had to revert to the current logic.
I wouldn't expect a performance hit if the conversion is the parameter, but SQL Server often misses easy conversions like that. A good C / C++ compiler would make a 64-temporary for comparisons in a loop - I don't know why SQL Server can't.
This could be a good argument for manually setting types, but since it requires code changes in the client would a case be just as good: "select * from t where id = (cast ? as bigint)"?
Well if it's a varchar, you could just default at a high number like 255 or 8000, since it should not affect performance on variable character fields. This should fix the multiple explain plans on the character issue for the most part, however it doesn't do much to help with the int/bigint issue.
As for the round trip, it might be worth it, not sure. I think testing would be required to see if it will create extra overhead but if it doesn't, I think it's definitely worth it.
For nulls I think I noticed earlier that it defaults to varchar instead of checking the type so I think that's definitely important if it hasn't been done already.
Also casting as bigint did not help, as the explain plan is still converting it.
I still think, while an automatic way is great, there should still be a way to explicitly set the datatypes, considering many users will already knows the datatypes and would probably be the most efficient way to do it assuming the datatypes are correct.
Hey mk, you seem to have closed this, does this mean you have a solution? :)
No - I not sure what I meant to close. Thanks
Please see the Binding Parameters page of the Wiki where I've collected some thoughts on how to move forward with this and a couple of other related issues.
I think something like pyodbc.set_var_binding_length(1024) might be what we want here.
Thoughts?
Seems like a good solution. So if I understand correctly you would have a command that lets us set the var length (say to 4096), and for nulls you would have to specify how it treats them (varchar vs int)?
If that's the case I do have an issue with the latter: what if I have to pass two variables of different types in the same statement and either or both can be null?
Would it be possible, perhaps, to enable/disable going the roundtrip to get the data types (and cache them for subsequent inserts/updates/etc) only for null values?
Also a note that the max in SQL Server on varchar is 8000, and nvarchar is 4000, so they're not powers of 2.
@ZaM-Rai said:
I still think, while an automatic way is great, there should still be a way to explicitly set the datatypes
I am inclined to agree, and it looks like the DB-API includes Cursor#setinputsizes for just that purpose.
I'm curious, any news on this issue?
Ideally, pyodbc would know the datatype/length before inserting and set them accordingly
The parameter array implementation I was working on for https://github.com/mkleehammer/pyodbc/issues/120 does exactly that. You can give it a try at https://github.com/v-chojas/pyodbc/tree/paramarray . It is currently only for executemany() but you can pass in a single row to observe the effect. It also binds according to the type on the server, so you won't be seeing any server-side int/bigint conversions.
Works like a charm, as far as caching the query plans go!
It does feel a bit more sluggish for some reason though, is executemany less efficient for passing a single row than execute?
Also using cursor.fetchone() on a select query does not give me any results (I get the error: Previous SQL was not a query). Any way to make this work with execute as well? This would solve a lot of issues and be very appreciated!
Good to hear that it fixes the execution plan proliferation.
I'm not surprised that it is slower, since that implementation also allocates and converts/copies the data into a parameter array; it was really intended for many-row inserts but as a "side effect" of using a parameter array, you need to know --- and thus bind --- the exact type+length. That error you're getting is just because the implementation is more of a proof-of-concept at the moment, so it's not complete.
It is possible to have execute() perform type discovery too, but as @mkleehammer mentioned above, it will add additional overhead. That overhead may be OK in the case of executemany() where it's only done once for many executions (and used to make those executions much faster, e.g. with parameter arrays), but will definitely slow things down if you only execute once.
Implementing DB-API Cursor.setinputsizes() would be my recommendation too --- perhaps with an extension to allow specifying exact SQL types, so that cases such as varchar vs. char can be distinguished, scale and precision specified for decimal types, etc. I can try to work on this.
I have implemented a simple version of setinputsizes() which only recognises a sequence of integers for specifying the column size for character types. E.g. cursor.setinputsizes([50]). You can try it at: https://github.com/v-chojas/pyodbc/tree/setinputsizes
(This branch does not have the parameter array improvements, so you can try it with regular execute() and it should not be slower.)
Although DB-API states that type objects can be used, it doesn't seem so useful in this case since we are detecting the SQL type from the Python type anyway --- and there's no way to distinguish between int/bigint because they are both mapped from Python integers.
The test failed; it is still passing strings as their original python types.
Here is my test code, to be used with an SQL Server trace:
cnxn = pyodbc.connect("DSN=ZAM_SQL", autocommit=True)
cursor = cnxn.cursor()
arg1 = 'Arg1'
arg2 = 'Arg12'
arg3 = 'Arg123'
arg4 = 'Arg1234'
arg5 = 'Arg12345'
arg6 = 'Arg123456'
arg7 = 'Arg1234567'
arg8 = 'Arg12345678'
arg9 = 'Arg123456789'
arg10 = 'Arg1234567890'
params = [arg1,arg2,arg3,arg4,arg5,arg6,arg7,arg8,arg9,arg10]
cursor.execute("IF OBJECT_ID('Test.dbo.Test2', 'U') IS NOT NULL DROP TABLE Test.dbo.Test2; ")
cnxn.commit()
cursor.execute("CREATE TABLE Test.dbo.Test2([Col1] [varchar](50) NULL)")
cnxn.commit()
for par in params:
cursor.setinputsizes([50])
cursor.execute("INSERT INTO Test.[dbo].Test2(Col1) values (?)", par)#.setinputsizes([50])
cnxn.commit()
cursor.close()
del cursor
cnxn.close()`
Here is a sample of the trace:
declare @p1 int set @p1=1 exec sp_prepexec @p1 output,N'@P1 varchar(4)',N'INSERT INTO Test.[dbo].Test2(Col1) values (@P1)','Arg1' select @p1
declare @p1 int set @p1=2 exec sp_prepexec @p1 output,N'@P1 varchar(5)',N'INSERT INTO Test.[dbo].Test2(Col1) values (@P1)','Arg12' select @p1
declare @p1 int set @p1=3 exec sp_prepexec @p1 output,N'@P1 varchar(6)',N'INSERT INTO Test.[dbo].Test2(Col1) values (@P1)','Arg123' select @p1
etc...
I see a similar trace with your test script, but with only "@p1 varchar(50)" throughout. Please ensure that you are testing with my branch of pyodbc which has the setinputsizes() implemented. You can check by printing pyodbc.Cursor.setinputsizes.__doc__ . The original and paramarray branches will say "Ignored." while setinputsizes branch has a longer message about its implementation.
I'm getting an error when trying to install it:
src\cursor.cpp(2120) : error C2017: illegal escape sequence
That was due to an extraneous trailing space, which some compilers will not accept. It should be fixed now.
Awesome, it works on my end too :)
Now all that's left is to be able to set it per column including int/bigint. Either way as is I think it's already a pretty big improvement!
For example, the ideal scenario imo would be Col1.setcharsize([50]) = 'Abc' Col2.setintsize([8]) = 123456 (8 for 8-bit bigint for example, though it might be more complicated for people who don't know what the bit size of the integers are, alternatively it could be a string for tinyint, smallint, int, bigint, etc)
edit that didn't make sense it would be more like Col1 = 'Abc' Col1 = Col1.setcharsize([50])
Not sure if that still makes sense but w/e
Good. You can already set it for multiple columns; just specify an array, and it will use the size if specified:
cursor.setinputsizes([50, 5, 16])
To distinguish between int/bigint (and the other SQL types) will be somewhat trickier; I think a simple way would be to allow specifying a tuple (size/precision, scale, type) e.g.
cursor.setinputsizes([50, (None, None, pyodbc.SQL_BIGINT)])
Awesome, yeah in my mind it would be best if it were somehow attached to the variable itself, but it's fine as well if it's on the cursor as long as it resets after each cursor use. The advantage of having it on the variable is that it won't change and you won't need to call the size/type every time the variable is used. If that's not practical though, your way is more than good enough.
I've updated the Binding Parameters wiki with another problem (Informix #260) and another proposal. Feedback?
Seems like a good plan in theory, this would probably solve it rather nicely as it wouldn't require setting any data types manually. I'm looking forward to trying it out!
@mkleehammer so, to summarise, the SQL type (and length/precision) could be obtained by:
- Explicit setting
- SQLDescribeParam
- Python type
in order of precedence. However, I'm not sure if the benefits of adding an explicit cache here would outweigh the complexity and edge-cases that it could introduce, since ODBC already has a mechanism for doing something similar: prepared statements. SQLDescribeParam may be slow, but for a single parameterised query there is no choice to call it if you want to bind with the same SQL type, and if you want to execute that same query more than once, you should be using prepared statements/executemany() in which case it is still only a single SQLDescribeParam call per parameter.
In particular, the case where the schema changes is important to consider. In such a case, pyODBC-cached information may not longer be correct.