data-import
data-import copied to clipboard
split import transaction
Currently each import block runs in a transaction. This may cause some issues with large datasets. My research showed that there is (practically) no hard limit of statements per transaction. But the bottleneck is the transaction log (or write ahead log) which will use a lot of memory.
I propose that we split the whole transaction into a configurable amount of transactions. That way the memory usage of the transaction log will be kept low. The performance loss will be feasible if we keep the number of statements per transaction above a few thousand.
Importing 500'000 records from MSSQL to Postgres using a single transaction used approximately 200MB RAM on my machine. If you extrapolate that to a few million records the memory usage explodes.
What do we do about data consistency? At the moment when the import fails, it rolls-back the current transaction, leaving the import script in a somewhat useable state. If we have multiple transactions, this could lead to half-migrated tables. I think at the moment it would not be a big deal but if we want to implement #5 this will be important. @stmichael what do you think?
That's true data consistency is a problem. I just opened this ticket so we don't forget about this. As I said, there is no hard limit for the amount of commits per transaction. That depends on the resources of your system.
I propose we leave this ticket open as an idea and postpone it to a later point in time when it is actually needed.
I'm fine with that.