dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Document performance tuning for destination memory usage

Open boxydog opened this issue 1 year ago • 1 comments

Documentation description

From https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1720800651873809:

I'm trying to load 100,000 rows of data into a DB (~168M). It works fine on my laptop (16G 2015 macbook pro). On a 2G DB Postgres server on our cloud provider, it exceeds memory usage and kills the DB. Any tips on how to reduce memory usage in Postgres 16 while loading using dlt? I read https://dlthub.com/docs/reference/performance, and it looks like there are a lot of knobs. Is there one that delivers the data to the server in smaller chunks?

Lowering normalize.data_writer.file_max_items allows the load to complete. My guess is that it changes the size of the chunks during the load phase.

I'd recommend changing the performance page to note this fact (if it is true): "If your destination is running out of memory, setting normalize.data_writer.file_max_items lower can reduce the memory required, because it uses smaller chunks in the load phase."

Also, documenting how to find the default values of all settings would be useful. I've set file_max_items to 5000, but what was it before I set it? I can file a separate issue if needed.

Are you a dlt user?

Yes, I'm already a dlt user.

boxydog avatar Jul 12 '24 20:07 boxydog

Note: Dave on slack suggested a "troubleshooting" section. FYI, here are the types of errors I saw when this was going wrong (with some counts from sort | uniq on the command line):

  14 psycopg2.InterfaceError: connection already closed
  10 psycopg2.OperationalError: SSL SYSCALL error: EOF detected
   8 psycopg2.OperationalError: connection to server at "myserver" (ip), port 5432 failed: server closed the connection unexpectedly
   4 psycopg2.OperationalError: connection to server at "myserver" (ip), port 5432 failed: FATAL:  the database system is shutting down
   4 psycopg2.OperationalError: SSL connection has been closed unexpectedly
   2 psycopg2.OperationalError: connection to server at "myserver" (ip), port 5432 failed: FATAL:  the database system is starting up
   1 psycopg2.OperationalError: connection to server at "myserver" (ip), port 5432 failed: FATAL:  the database system is starting up
1 <class 'psycopg2.OperationalError'>

boxydog avatar Jul 12 '24 21:07 boxydog

Note to be added:

Note: Adjusting the file_max_items and file_max_bytes settings under the [normalize.data_writer] configuration can also affect the load behavior. By setting a lower value for file_max_items, you effectively reduce the size of each data chunk sent to the destination database. This can be particularly useful when dealing with memory constraints on the database server.

dat-a-man avatar Aug 24 '24 09:08 dat-a-man

@AstrakhantsevaAA please review.

dat-a-man avatar Aug 24 '24 09:08 dat-a-man

@dat-a-man actually, it works a bit differently

By setting a lower value for file_max_items, you effectively reduce the size of each data chunk sent to the destination database.

This statement is correct, but with a few nuances:

  • file_max_items controls how many rows will be written in intermediary file.
  • Also, this parameter enables a file rotation, by default dlt writes all data rows in the one big intermediary file.
  • It means, if you set this parameter explicitly, dlt inserts your data in chunks (file by file), otherwise dlt will try to insert all your data from a one huge file.

Go through this Colab Notebook to understand how it works exactly: https://colab.research.google.com/drive/1aC2V27rNko2dLkb2IP4UmiyhtD3MoqFX

AstrakhantsevaAA avatar Aug 25 '24 19:08 AstrakhantsevaAA