dlt
dlt copied to clipboard
Document performance tuning for destination memory usage
Documentation description
From https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1720800651873809:
I'm trying to load 100,000 rows of data into a DB (~168M). It works fine on my laptop (16G 2015 macbook pro). On a 2G DB Postgres server on our cloud provider, it exceeds memory usage and kills the DB. Any tips on how to reduce memory usage in Postgres 16 while loading using dlt? I read https://dlthub.com/docs/reference/performance, and it looks like there are a lot of knobs. Is there one that delivers the data to the server in smaller chunks?
Lowering normalize.data_writer.file_max_items allows the load to complete. My guess is that it changes the size of the chunks during the load phase.
I'd recommend changing the performance page to note this fact (if it is true): "If your destination is running out of memory, setting normalize.data_writer.file_max_items lower can reduce the memory required, because it uses smaller chunks in the load phase."
Also, documenting how to find the default values of all settings would be useful. I've set file_max_items to 5000, but what was it before I set it? I can file a separate issue if needed.
Are you a dlt user?
Yes, I'm already a dlt user.
Note: Dave on slack suggested a "troubleshooting" section. FYI, here are the types of errors I saw when this was going wrong (with some counts from sort | uniq on the command line):
14 psycopg2.InterfaceError: connection already closed
10 psycopg2.OperationalError: SSL SYSCALL error: EOF detected
8 psycopg2.OperationalError: connection to server at "myserver" (ip), port 5432 failed: server closed the connection unexpectedly
4 psycopg2.OperationalError: connection to server at "myserver" (ip), port 5432 failed: FATAL: the database system is shutting down
4 psycopg2.OperationalError: SSL connection has been closed unexpectedly
2 psycopg2.OperationalError: connection to server at "myserver" (ip), port 5432 failed: FATAL: the database system is starting up
1 psycopg2.OperationalError: connection to server at "myserver" (ip), port 5432 failed: FATAL: the database system is starting up
1 <class 'psycopg2.OperationalError'>
Note to be added:
Note: Adjusting the
file_max_itemsandfile_max_bytessettings under the[normalize.data_writer]configuration can also affect the load behavior. By setting a lower value forfile_max_items, you effectively reduce the size of each data chunk sent to the destination database. This can be particularly useful when dealing with memory constraints on the database server.
@AstrakhantsevaAA please review.
@dat-a-man actually, it works a bit differently
By setting a lower value for file_max_items, you effectively reduce the size of each data chunk sent to the destination database.
This statement is correct, but with a few nuances:
file_max_itemscontrols how many rows will be written in intermediary file.- Also, this parameter enables a file rotation, by default dlt writes all data rows in the one big intermediary file.
- It means, if you set this parameter explicitly, dlt inserts your data in chunks (file by file), otherwise dlt will try to insert all your data from a one huge file.
Go through this Colab Notebook to understand how it works exactly: https://colab.research.google.com/drive/1aC2V27rNko2dLkb2IP4UmiyhtD3MoqFX