dbt-duckdb Batched Python Model Still Loading all Data into Memory?

Batched Python Model Still Loading all Data into Memory?

Open kylejbrk opened this issue 8 months ago • 5 comments

I took the current example from the readme and used my large multi-gig model to test the chunking:

import pyarrow as pa

def batcher(batch_reader: pa.RecordBatchReader):
    for batch in batch_reader:
        df = batch.to_pandas()
        # Do some operations on the DF...
        # ...then yield back a new batch
        yield pa.RecordBatch.from_pandas(df)

def model(dbt, session):
    big_model = dbt.ref("big_model")
    batch_reader = big_model.record_batch(1000)
    batch_iter = batcher(batch_reader)
    return pa.RecordBatchReader.from_batches(batch_reader.schema, batch_iter)

Then I watched my docker container usage to see if the memory increased as the code executed or if it would stay flat. I saw the longer the code ran the more memory I was using, going from 0 to around 12 gigs of memory used. The reason I wanted to chunk was so I could keep this under 8 gigs. So maybe im misunderstanding how this works. It does chunk the data, but pa.RecordBatchReader.from_batches(batch_reader.schema, batch_iter) seems to aggregate these chunks and hold it in memory instead of writing straight to disk?

Feb 06 '25 15:02 kylejbrk

dbt-duckdb dbt-duckdb copied to clipboard

Batched Python Model Still Loading all Data into Memory?

dbt-duckdb
dbt-duckdb copied to clipboard