`MERGE_ADJACENT_FILES` causes a segfault
What happens?
Thanks for the ducklake project! I'm incredibly excited and bullish about it, and have been trying to use it actively in my stack.
I've been running a ducklake since v0.2, and have accumulated an enormous amount of extremely small files (597,544 files of 1-100 rows each). My setup is
- Catalog in postgres
- Storage in GCS (sometimes I go through the
gs://interface, sometimes via a volume mount.
I was unable to run the merge for a while as I was waiting on #369, which is how I accumulated so many files.
When I try to run the merge now, I get to about 90GB of memory usage, and after about 9 hours of runtime, I get a SEGFAULT
Runner segmentation fault (SIGSEGV), exit code: 139.
I turned on logging but did not get any logs during the merge. I'm running this on modal, connecting to GCS via their cloud bucket mounts. Here are the stats I see before failure
I'm not totally sure how to create a reproducible example here, which I apologize for. I opened a thread in discord and it was suggested I bring it to github.
Thank you for any help/support you have!
To Reproduce
This is how I ran the merge
import duckdb
DUCKLAKE_NAME = "ducklake_catalog"
conn = duckdb.connect()
conn.execute("INSTALL ducklake;")
conn.execute("INSTALL postgres;")
conn.execute("UPDATE EXTENSIONS")
conn.execute(f"""
CREATE SECRET gcs_secret (
TYPE gcs,
KEY_ID '{env.GOOGLE_ACCESS_KEY_ID}',
SECRET '{env.GOOGLE_ACCESS_KEY_SECRET}'
)
""")
data_path_root = f"/{gcs_volume_mount_dir}"
cmd = f"""
ATTACH 'ducklake:postgres:host={HOST} port={PORT} dbname=postgres user={USER} password={PASSWORD} sslmode={SSLMODE}'
AS {DUCKLAKE_NAME}
(DATA_PATH '{data_path_root}/{DUCKLAKE_NAME}', METADATA_SCHEMA {DUCKLAKE_NAME}, OVERRIDE_DATA_PATH true);
"""
conn.execute("CALL enable_logging();")
conn.execute("CALL enable_logging(level='info');")
conn.execute("CALL merge_adjacent_files()")
OS:
Debian GNU/Linux 12 (bookworm)
DuckDB Version:
1.4.0
DuckLake Version:
45788f0
DuckDB Client:
python
Hardware:
No response
Full Name:
Ben Epstein
Affiliation:
Grotto, Inc
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - I cannot share the data sets because they are confidential
Did you include all code required to reproduce the issue?
- [ ] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [x] Yes, I have
It seems to me that Ducklake is not merging the many small files in a chunk based approach (partitioning). I would expect that Ducklake is not fetching more data it can merge into a new Parquet file without exceeding the target data file size. If the memory consumption exceeds 90 GB, we're beyond the target data file size.
Have you tried setting target_file_size to something small (ie the size of 100 files combined)? It seems it might work if run the merge multiple times, slowly increasing the size, so it doesn't try to combine too many files together. https://ducklake.select/docs/stable/specification/tables/ducklake_metadata
You should also be able to see the new files getting written in the bucket as it works. It writes them then commits the metadata changes at the end.
I am also facing this issue. ducklake_merge_adjacent_files runs for 30min with ~ 1.2GB of ram usage and then it's starting growing to 10GB+ and I got an OOM.
Hi @aclowes my apologies I wasn’t getting notifications on this open issue for some reason.
I will try that and see if progress is being made. I ran it ~6 times before opening this issue, but I didn’t think to check for progress
By the way @aclowes It might be useful to update the docs here - it says the default for target_file_size is 512MB but this value seems to require an integer representing the number of bytes (otherwise you get an error like InvalidInputException: Invalid Input Error: Failed to cast value: Could not convert string '10MB' to UINT64
Have you tried setting
target_file_sizeto something small (ie the size of 100 files combined)? It seems it might work if run the merge multiple times, slowly increasing the size, so it doesn't try to combine too many files together. https://ducklake.select/docs/stable/specification/tables/ducklake_metadataYou should also be able to see the new files getting written in the bucket as it works. It writes them then commits the metadata changes at the end.
I ended up getting this error:
Invalid Input Error: Attempting to fetch from an unsuccessful query result
Error: TransactionContext Error: Failed to commit: Failed to execute query "ROLLBACK":
But no other logs. Is there a debug setting I can turn on?
EDIT: Found this in stdout
python3(54834,0x16df9b000) malloc: Failed to allocate segment from range group - out of space
Error in watcher: {"exception_type":"Invalid","exception_message":"Failed to query most recent snapshot for DuckLake: Failed to prepare COPY \"COPY (SELECT \"snapshot_id\", \"schema_version\", \"next_catalog_id\", \"next_file_id\" FROM (SELECT snapshot_id, schema_version, next_catalog_id, next_file_id\n\t\t FROM \"ducklake_catalog\".ducklake_snapshot WHERE snapshot_id = (\n\t\t SELECT MAX(snapshot_id) FROM \"ducklake_catalog\".ducklake_snapshot\n\t\t )) AS __unnamed_subquery ) TO STDOUT (FORMAT \"binary\");\":
Hi @Ben-Epstein and others, From DuckDB v1.4.3 MERGE_ADJACENT_FILES will have an incremental option for scenarios where many files must be merged.
See: https://github.com/duckdb/ducklake/pull/592