ducklake `MERGE_ADJACENT_FILES` causes a segfault

What happens?

Thanks for the ducklake project! I'm incredibly excited and bullish about it, and have been trying to use it actively in my stack.

I've been running a ducklake since v0.2, and have accumulated an enormous amount of extremely small files (597,544 files of 1-100 rows each). My setup is

Catalog in postgres
Storage in GCS (sometimes I go through the gs:// interface, sometimes via a volume mount.

I was unable to run the merge for a while as I was waiting on #369, which is how I accumulated so many files.

When I try to run the merge now, I get to about 90GB of memory usage, and after about 9 hours of runtime, I get a SEGFAULT

Runner segmentation fault (SIGSEGV), exit code: 139.

I turned on logging but did not get any logs during the merge. I'm running this on modal, connecting to GCS via their cloud bucket mounts. Here are the stats I see before failure

I'm not totally sure how to create a reproducible example here, which I apologize for. I opened a thread in discord and it was suggested I bring it to github.

Thank you for any help/support you have!

To Reproduce

This is how I ran the merge

import duckdb

DUCKLAKE_NAME = "ducklake_catalog"
conn = duckdb.connect()
conn.execute("INSTALL ducklake;")
conn.execute("INSTALL postgres;")
conn.execute("UPDATE EXTENSIONS")

conn.execute(f"""
    CREATE SECRET gcs_secret (
        TYPE gcs,
        KEY_ID '{env.GOOGLE_ACCESS_KEY_ID}',
        SECRET '{env.GOOGLE_ACCESS_KEY_SECRET}'
    )
""")

data_path_root = f"/{gcs_volume_mount_dir}"
cmd = f"""
ATTACH 'ducklake:postgres:host={HOST} port={PORT} dbname=postgres user={USER} password={PASSWORD} sslmode={SSLMODE}'
AS {DUCKLAKE_NAME}
(DATA_PATH '{data_path_root}/{DUCKLAKE_NAME}', METADATA_SCHEMA {DUCKLAKE_NAME}, OVERRIDE_DATA_PATH true);
"""

conn.execute("CALL enable_logging();")
conn.execute("CALL enable_logging(level='info');")
conn.execute("CALL merge_adjacent_files()")

OS:

Debian GNU/Linux 12 (bookworm)

DuckDB Version:

1.4.0

DuckLake Version:

45788f0

DuckDB Client:

python

Hardware:

No response

Full Name:

Ben Epstein

Affiliation:

Grotto, Inc

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Did you include all code required to reproduce the issue?

[ ] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

[x] Yes, I have

Sep 22 '25 20:09 Ben-Epstein

It seems to me that Ducklake is not merging the many small files in a chunk based approach (partitioning). I would expect that Ducklake is not fetching more data it can merge into a new Parquet file without exceeding the target data file size. If the memory consumption exceeds 90 GB, we're beyond the target data file size.

Sep 30 '25 11:09 arouel

Have you tried setting target_file_size to something small (ie the size of 100 files combined)? It seems it might work if run the merge multiple times, slowly increasing the size, so it doesn't try to combine too many files together. https://ducklake.select/docs/stable/specification/tables/ducklake_metadata

You should also be able to see the new files getting written in the bucket as it works. It writes them then commits the metadata changes at the end.

Oct 22 '25 17:10 aclowes

I am also facing this issue. ducklake_merge_adjacent_files runs for 30min with ~ 1.2GB of ram usage and then it's starting growing to 10GB+ and I got an OOM.

Oct 27 '25 17:10 guitcastro

Hi @aclowes my apologies I wasn’t getting notifications on this open issue for some reason.

I will try that and see if progress is being made. I ran it ~6 times before opening this issue, but I didn’t think to check for progress

Nov 18 '25 14:11 Ben-Epstein

By the way @aclowes It might be useful to update the docs here - it says the default for target_file_size is 512MB but this value seems to require an integer representing the number of bytes (otherwise you get an error like InvalidInputException: Invalid Input Error: Failed to cast value: Could not convert string '10MB' to UINT64

Nov 24 '25 01:11 Ben-Epstein

Have you tried setting target_file_size to something small (ie the size of 100 files combined)? It seems it might work if run the merge multiple times, slowly increasing the size, so it doesn't try to combine too many files together. https://ducklake.select/docs/stable/specification/tables/ducklake_metadata

You should also be able to see the new files getting written in the bucket as it works. It writes them then commits the metadata changes at the end.

I ended up getting this error:

Invalid Input Error: Attempting to fetch from an unsuccessful query result
Error: TransactionContext Error: Failed to commit: Failed to execute query "ROLLBACK":

But no other logs. Is there a debug setting I can turn on?

EDIT: Found this in stdout

python3(54834,0x16df9b000) malloc: Failed to allocate segment from range group - out of space
Error in watcher: {"exception_type":"Invalid","exception_message":"Failed to query most recent snapshot for DuckLake: Failed to prepare COPY \"COPY (SELECT \"snapshot_id\", \"schema_version\", \"next_catalog_id\", \"next_file_id\" FROM (SELECT snapshot_id, schema_version, next_catalog_id, next_file_id\n\t\t FROM \"ducklake_catalog\".ducklake_snapshot WHERE snapshot_id = (\n\t\t     SELECT MAX(snapshot_id) FROM \"ducklake_catalog\".ducklake_snapshot\n\t\t )) AS __unnamed_subquery ) TO STDOUT (FORMAT \"binary\");\":

Nov 24 '25 12:11 Ben-Epstein

Hi @Ben-Epstein and others, From DuckDB v1.4.3 MERGE_ADJACENT_FILES will have an incremental option for scenarios where many files must be merged.

See: https://github.com/duckdb/ducklake/pull/592

Dec 05 '25 18:12 pdet