pyodbc icon indicating copy to clipboard operation
pyodbc copied to clipboard

ODBC bulk copy operations request

Open keitherskine opened this issue 7 years ago • 5 comments

This isn't an issue as such, more of a request. It's a bit specific to SQL Server, but are there any plans to implement the ODBC bulk copy functions described here: https://docs.microsoft.com/en-us/sql/relational-databases/native-client-odbc-extensions-bulk-copy-functions/sql-server-driver-extensions-bulk-copy-functions

If I can do bcp style operations through pyodbc rather than through the command line, that would be very helpful for me.

keitherskine avatar Feb 20 '18 13:02 keitherskine

Have you tried the fast_executemany feature? https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API This is not a bcp but the speed is quite decent.

boumboum avatar Feb 22 '18 17:02 boumboum

I was curious to see if there was a significant difference in speed between the two approaches, so I did a quick informal test using a CSV file with 100,000 rows ...

1,record000000
2,record000001
3,record000002
...
100000,record099999

... inserting into an empty table on a remote SQL Server instance.

The times for bcp to do the insert were highly variable, ranging from 7 seconds to 32 seconds.

For fast_executemany to do the insert, i.e.,

cnxn = pyodbc.connect(conn_str, autocommit=False)
crsr = cnxn.cursor()
crsr.fast_executemany = True
t0 = time.time()
data = []
with open(r'C:\Users\Gord\Desktop\test100k.csv', newline='') as csvfile:
    rdr = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_NONE)
    data = [(int(row[0]), row[1]) for row in rdr]
sql = 'INSERT INTO test100k (id, txt) VALUES (?,?)'
crsr.executemany(sql, data)
cnxn.commit()
print(f"{(time.time() - t0):.1f} seconds")

the times were more consistent, ranging from 20 seconds to 30 seconds.

So, fast_executemany may turn out to be slower than bcp, but perhaps not outrageously so.

gordthompson avatar Mar 04 '18 19:03 gordthompson

Thank you for running some tests @gordthompson . In my case, I am importing gigabytes, if not tens of gigabytes, of data on a regular basis so holding all that in memory would be problematic. Yes, I could split the data into chunks but still, it's a lot simpler to just load the data directly from disc file to database. Also, taking the best times from your tests, the fast_executemany appears to be 3 times slower than bcp. That's not bad, but is still quite a considerable difference.

Right now, I run bcp commands through python using subprocess which does work of course, but if running bulk loads through pyodbc was available I would almost certainly use it.

keitherskine avatar Mar 05 '18 15:03 keitherskine

I vaguely recall there being language specific libraries from different database vendors - something like sybase's perl bcp library. If i'm not mistaken, it allowed bcp calls using in memory data, contrary to the bcp command line utility. Is this no longer a thing?

nrhy avatar Apr 16 '18 15:04 nrhy

I'm a bit late to the party, but, in case anyone still cares, we wrote this around 2008 and have been using it in production. It was open sourced around 2013

https://github.com/damionw/pythonbcp

damionw avatar Jun 09 '19 03:06 damionw