pyodbc
pyodbc copied to clipboard
ODBC bulk copy operations request
This isn't an issue as such, more of a request. It's a bit specific to SQL Server, but are there any plans to implement the ODBC bulk copy functions described here: https://docs.microsoft.com/en-us/sql/relational-databases/native-client-odbc-extensions-bulk-copy-functions/sql-server-driver-extensions-bulk-copy-functions
If I can do bcp style operations through pyodbc rather than through the command line, that would be very helpful for me.
Have you tried the fast_executemany feature? https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API This is not a bcp but the speed is quite decent.
I was curious to see if there was a significant difference in speed between the two approaches, so I did a quick informal test using a CSV file with 100,000 rows ...
1,record000000
2,record000001
3,record000002
...
100000,record099999
... inserting into an empty table on a remote SQL Server instance.
The times for bcp to do the insert were highly variable, ranging from 7 seconds to 32 seconds.
For fast_executemany to do the insert, i.e.,
cnxn = pyodbc.connect(conn_str, autocommit=False)
crsr = cnxn.cursor()
crsr.fast_executemany = True
t0 = time.time()
data = []
with open(r'C:\Users\Gord\Desktop\test100k.csv', newline='') as csvfile:
rdr = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_NONE)
data = [(int(row[0]), row[1]) for row in rdr]
sql = 'INSERT INTO test100k (id, txt) VALUES (?,?)'
crsr.executemany(sql, data)
cnxn.commit()
print(f"{(time.time() - t0):.1f} seconds")
the times were more consistent, ranging from 20 seconds to 30 seconds.
So, fast_executemany may turn out to be slower than bcp, but perhaps not outrageously so.
Thank you for running some tests @gordthompson . In my case, I am importing gigabytes, if not tens of gigabytes, of data on a regular basis so holding all that in memory would be problematic. Yes, I could split the data into chunks but still, it's a lot simpler to just load the data directly from disc file to database. Also, taking the best times from your tests, the fast_executemany appears to be 3 times slower than bcp. That's not bad, but is still quite a considerable difference.
Right now, I run bcp commands through python using subprocess which does work of course, but if running bulk loads through pyodbc was available I would almost certainly use it.
I vaguely recall there being language specific libraries from different database vendors - something like sybase's perl bcp library. If i'm not mistaken, it allowed bcp calls using in memory data, contrary to the bcp command line utility. Is this no longer a thing?
I'm a bit late to the party, but, in case anyone still cares, we wrote this around 2008 and have been using it in production. It was open sourced around 2013
https://github.com/damionw/pythonbcp