improve mssql insert speed with `fast_executemany` and `BULK INSERT`
Background
The way we insert rows into mssql is ineffective. We should switch to bulk copy. MS odbc drivers come with bcp command that we can use.
https://github.com/yehoshuadimarsky/bcpandas/blob/master/bcpandas/utils.py
is a project that does that quite well
Tasks
-
- [ ] allow
mssqlandsynapseto handlecsv
- [ ] allow
-
- [ ] use
bcpin copy jobs
- [ ] use
-
- [ ] looks like that we can pass DSN as credentials and specify file format to copy on the fly: https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-ver16
@rudolfix Do we also want to do this for synapse? For synapse we already implement SynapseCopyFileLoadJob based on COPY INTO, which is the recommended approach:
"While dedicated SQL pools support many loading methods, including popular SQL Server options such as bcp and the SqlBulkCopy API, the fastest and most scalable way to load data is through PolyBase external tables and the COPY statement.
With PolyBase and the COPY statement, you can access external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language. For the most flexibility when loading, we recommend using the COPY statement."
Source: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-elt-data-loading
@jorritsandbrink it just may work for synapse if we implement it ie. via csv files + bcp but it is not required. The reason for the ticket is abysmal insert performance of sql server. it is WAY slower than postgres. nevertheless priority of this ticket is quit low (at least for now)
Chiming in here; one of the big hold back for shifting to dlt is the abysmal odbc performance over bcp.
Some of our sources are tall and wide so bcp helps a ton with performance, so if this comes to be then I would be thrilled!
More considerations about the nuances of the data though (date time/floats/column ordering/etc) then sqlalchemy.
Hi @rudolfix , for a mssql destination: apart from loading in parallel/batches and current approach of arrayed inserts or using bcp for bulk loading, would it possible to also use single prepared/parameterized inserts and using fast_executemany driver options? This would leave handling of arrayed inserts to the driver and may lead to better bulk database loading performance. Something like this:
cnxn = pyodbc.connect(connection_string, autocommit=True)
crsr = cnxn.cursor()
data = crsr.execute("SELECT * FROM MillionRows").fetchall()
crsr.close()
printmem("data loaded")
crsr = cnxn.cursor()
ins = "INSERT INTO #tmp (ID, TextField, LongIntegerField, DoubleField, varchar_column) VALUES (?,?,?,?,?)"
crsr.fast_executemany = True
crsr.executemany(ins, data)
crsr.close()
I could create an sqlalchemy engine like this: engine = create_engine(destination_conn, fast_executemany=True) But how to get this configuration into a dlt destination? I can't find a way to inject it with a created destination: destination = dlt.destinations.mssql(credentials) destination.spec seems not the right way to pass sa engine parameters.
The post here is somewhat old and the current pyodbc version doesn't need to intercept events: https://medium.com/analytics-vidhya/speed-up-bulk-inserts-to-sql-db-using-pandas-and-python-61707ae41990
But you can see the possible performance gain as compared to other approaches. And meanwhile I also understand, that using prepared statements alone is no improvement, as the network round trips stay the same. So fast_executemany will also batch inserts but I guess will use a real arrayed insert, may be even bypassing transaction management (thus autocommit needs to be disabled) like real bulk loading would.