Performance issue with the write operations
The write operation creates a connection per bundle which can very small for streaming causing a flood of short lived connections and without connection pooling, this could become a performance bottleneck. Also, having a batch write option would significantly reduce the number of writes.
+1
I have a suggested pull request I can add here in a little bit.
The idea is to emulate how we write in batches with the JdbcIO connector for Java.
We can call .write_record() without doing the .commit() every time. We then move the .commit() to the .finishBundle() and then a little bit of logic for committing batches for N amount of records to prevent the commit from getting too large.
Additionally, opening and creating a new connection is a an expensive operation from the server side. This stackexchange post talks more about why. Moving the connections to a static pool allows the reuse of connections, which is best practice for SQL servers.
Added advantages of writing in batches, after a GBK, allows you to keep all transactions for the affected record in one place. This drastically reduces the amount of thread contention in the SQL server as threads (each handling a connection and a transaction) won't be waiting to access the same row. Lastly it also allows the users to take advantage of server optimizations since we submit it all as one transaction.
Thanks @chishankar-work; I'll have a look soon at your PR.
In which state is this improvement? I'm worried that all the single inserts will hammer very hard the PostgreSQL database that I have.
It's a bit tricky in my opinion when I want a custom UPSERT and, at the same time, doing it in batch mode.
@chishankar-work will your PR support this feature?
@mohaseeb It seems that the original author dropped this feature. I'm happy to prepare a new PR.
Did this enhancement ever get merged in?