data-import split import transaction

Currently each import block runs in a transaction. This may cause some issues with large datasets. My research showed that there is (practically) no hard limit of statements per transaction. But the bottleneck is the transaction log (or write ahead log) which will use a lot of memory.

I propose that we split the whole transaction into a configurable amount of transactions. That way the memory usage of the transaction log will be kept low. The performance loss will be feasible if we keep the number of statements per transaction above a few thousand.

Sep 06 '12 06:09 stmichael

Importing 500'000 records from MSSQL to Postgres using a single transaction used approximately 200MB RAM on my machine. If you extrapolate that to a few million records the memory usage explodes.

Sep 06 '12 07:09 stmichael

What do we do about data consistency? At the moment when the import fails, it rolls-back the current transaction, leaving the import script in a somewhat useable state. If we have multiple transactions, this could lead to half-migrated tables. I think at the moment it would not be a big deal but if we want to implement #5 this will be important. @stmichael what do you think?

Sep 06 '12 15:09 senny

That's true data consistency is a problem. I just opened this ticket so we don't forget about this. As I said, there is no hard limit for the amount of commits per transaction. That depends on the resources of your system.

I propose we leave this ticket open as an idea and postpone it to a later point in time when it is actually needed.

Sep 10 '12 06:09 stmichael

I'm fine with that.

Sep 11 '12 07:09 senny

data-import data-import copied to clipboard

split import transaction

data-import
data-import copied to clipboard