merge_adjacent_files not compacting a table with lots of single row files
What happens?
We have a situation where some of our DuckLake tables in our application do not appear to get compacted when calling ducklake_merge_adjacent_files.
One of the tables that we have been testing with contains about 2838 rows; looking through the ducklake_data_files table and filtering by the table_id of that table, we see that there are 2555 data files linked to the table, most of them containing only 1 row, with the notable exception of the initial file, which contined 157 rows. There are no entries in ducklake_deleted_files associated with the table, and none of the existing files contain any lightweight snapshot entries in the partial_file_info column. We also do not have a custom target_file_size set. Yet, for some reason the table just refuses to get compacted down, and as a result our queries running against this table end up experiencing significant slowdown.
I have attempted to reproduce the issue but have not been able to so far. I have created tables with approximately the same number of rows as well as distribution amongst the files, but while I have sometimes seen a table compact only "partially" (in the sense that a handful of single-row files survive the compaction), I have not been able to reproduce the issue where no files at all are getting compacted.
I have attached the rows of the ducklake_data_files table that pertain to the table that we investigated.
To Reproduce
Unfortunately I do not have a fully reproducible example at this point, but I am still trying to get one and will update here if I succeed.
OS:
macOS 15.5
DuckDB Version:
1.4.1
DuckLake Version:
f134ad8
DuckDB Client:
Python
Hardware:
No response
Full Name:
Oliver Hsu
Affiliation:
Ascend.io
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Did you include all code required to reproduce the issue?
- [ ] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [x] Yes, I have
Thanks for opening this issue in the DuckLake issue tracker! To resolve this issue, our team needs a reproducible example. This includes:
- A source code snippet which reproduces the issue.
- The snippet should be self-contained, i.e., it should contain all imports and should use relative paths instead of hard coded paths (please avoid
/Users/JohnDoe/...). - A lot of issues can be reproduced with plain SQL code executed in the DuckDB command line client. If you can provide such an example, it greatly simplifies the reproduction process and likely results in a faster fix.
- If the script needs additional data, please share the data as a CSV, JSON, or Parquet file. Unfortunately, we cannot fix issues that can only be reproduced with a confidential data set. Support contracts allow sharing confidential data with the core DuckDB team under NDA.
For more detailed guidelines on how to create reproducible examples, please visit Stack Overflow's “Minimal, Reproducible Example” page.
Potentially related to this https://github.com/duckdb/duckdb/issues/19552 I'm gonna tag this as needs a reproducer but let's see if this is fixed for 1.4.2
Hello, I believe I may have found the source of this issue. It seems that calling the compaction routine is not having an effect on our table because all of the data files are associated with different schema versions (they have begin_snapshot values that are associated with different schema version in the ducklake_schema_versions table). Per the code in ducklake_compaction_functions.cpp, it looks like having different schema versions renders two files "incompatible" for the purposes of compaction (that is, they will not be grouped together as potential compaction candidates): since schema version is a field of the DuckLakeCompactionGroup object that is used as the key in the candidates map, no two files will share a group.
I also was able to recreate this behavior by creating a table and making X number of small insertions into it (thus creating X data files); compacting the table afterwards worked as expected, producing a single data file that holds all data. Then, in a second run, I added a CREATE TABLE query (with other, random table names unrelated to the table being populated) inbetween each data insertion; this resulted in the compaction not making any changes to the data files.
Can you confirm that this is the intended behavior? If it is, do you have any recommendations for how to keep a table compacted in an environment where table additions/modifications/deletions are expected to happen frequently in-between the data insertions? Should we try to run the compaction command on the table every time an insertion is made? (I'm aware of the recent addition of inlining support for PostgreSQL, that is one potential solution we're currently evaluating).
Could you provide the reproducer then? I understand what you say and if we decide to open the possibility of merging files with different schemas we will need a nice test case, so a repro can help!
Here's a script that demonstrates the issue. With the trigger_schema_version_change_between_inserts flag at the top set to False, the script completes successfully, but if set to True, the final assertion for data file count == 1 fails due to the table not having gotten compacted.
import duckdb
from pathlib import Path
# ============================================================================
# CONFIGURATION - MODIFY THESE VALUES FOR YOUR SETUP
# ============================================================================
postgres_config = {
"host": "localhost",
"port": 5432,
"database": "ascend-core",
"user": "ascend-core",
"password": "ascend-core",
}
data_path = "~/.ascend/ducklake_local"
workspace_name = "my_ducklake"
metadata_schema = "my_ducklake"
trigger_schema_version_change_between_inserts = False
# ============================================================================
# SETUP STORAGE PATH
# ============================================================================
storage_path = Path(data_path).expanduser().absolute()
storage_path.mkdir(parents=True, exist_ok=True)
# ============================================================================
# CREATE DUCKDB CONNECTION
# ============================================================================
conn = duckdb.connect(":memory:")
# Set DuckDB configuration
conn.execute("""
SET threads = 10;
PRAGMA enable_object_cache = false;
SET preserve_insertion_order = false;
SET allocator_background_threads = true;
""")
# ============================================================================
# INSTALL AND LOAD EXTENSIONS
# ============================================================================
conn.execute("FORCE INSTALL ducklake FROM core_nightly;")
conn.execute("INSTALL postgres;")
conn.execute("LOAD ducklake;")
conn.execute("LOAD postgres;")
# ============================================================================
# CREATE POSTGRES SECRET FOR METADATA
# ============================================================================
conn.execute(f"""
CREATE SECRET IF NOT EXISTS ascend_postgres_metadata (
TYPE postgres,
HOST '{postgres_config["host"]}',
PORT {postgres_config["port"]},
DATABASE '{postgres_config["database"]}',
USER '{postgres_config["user"]}',
PASSWORD '{postgres_config["password"]}'
);
""")
# ============================================================================
# CREATE DUCKLAKE SECRET
# ============================================================================
conn.execute(f"""
CREATE SECRET IF NOT EXISTS ascend_ducklake_main (
TYPE ducklake,
METADATA_PATH '',
METADATA_SCHEMA '{metadata_schema}',
DATA_PATH '{storage_path}/',
METADATA_PARAMETERS map {{
'TYPE': 'postgres',
'SECRET': 'ascend_postgres_metadata'
}}
);
""")
# ============================================================================
# ATTACH DUCKLAKE CATALOG
# ============================================================================
conn.execute(f'ATTACH IF NOT EXISTS \'ducklake:ascend_ducklake_main\' AS {workspace_name};')
# Use the Ducklake catalog as default
conn.execute(f'USE {workspace_name};')
# ============================================================================
# CREATE TABLE WITH 10 BATCHES OF DATA
# ============================================================================
# Create a simple test table
conn.execute("""
CREATE OR REPLACE TABLE example_table AS
SELECT
1 AS id
""")
for i in range(2, 11):
if trigger_schema_version_change_between_inserts:
conn.execute("CREATE OR REPLACE TABLE some_other_table AS SELECT 'foo' as foo")
conn.execute(f"""
INSERT INTO example_table (id)
VALUES ({i})
""")
result = conn.execute("SELECT COUNT(*) AS row_count FROM example_table")
row_count = result.fetchone()[0]
assert row_count == 10, f"Table should contain 10 rows but has {row_count}"
# Verify there are 10 data files for the table
metadata_catalog = f"__ducklake_metadata_{workspace_name}.{metadata_schema}"
file_count = conn.execute(f"""
SELECT COUNT(*) AS file_count
FROM {metadata_catalog}.ducklake_data_file df
INNER JOIN {metadata_catalog}.ducklake_table t ON df.table_id = t.table_id
WHERE t.table_name = 'example_table'
AND df.end_snapshot IS NULL
AND t.end_snapshot IS NULL
""").fetchone()[0]
assert file_count == 10, f"Before compaction, table should have 10 data files but has {file_count}"
# ============================================================================
# COMPACT THE TABLE
# ============================================================================
conn.execute(f"CALL ducklake_merge_adjacent_files('{workspace_name}','example_table');")
new_file_count = conn.execute(f"""
SELECT COUNT(*) AS file_count
FROM {metadata_catalog}.ducklake_data_file df
INNER JOIN {metadata_catalog}.ducklake_table t ON df.table_id = t.table_id
WHERE t.table_name = 'example_table'
AND df.end_snapshot IS NULL
AND t.end_snapshot IS NULL
""").fetchone()[0]
assert new_file_count == 1, f"After compaction, table should have 1 data file but has {new_file_count}"