ducklake icon indicating copy to clipboard operation
ducklake copied to clipboard

`ducklake_cleanup_old_files()` does not delete files when using external PostgreSQL catalog

Open gacolitti opened this issue 4 weeks ago • 3 comments

What happens?

Summary

When using DuckLake with an external PostgreSQL catalog and explicit DATA_PATH, ducklake_cleanup_old_files() returns 0 and does not delete any files, despite files being correctly scheduled for deletion and the data_path being correctly stored in ducklake_metadata.

Environment

  • DuckDB version: v1.4.2
  • DuckLake extension version: 77f2512 (from core_nightly)
  • OS: macOS 15.1 (arm64)
  • External catalog: PostgreSQL 14

To Reproduce

-- Step 1: Create a DuckLake catalog with external Postgres and explicit DATA_PATH
-- (Run in a fresh DuckDB session)

INSTALL ducklake FROM core_nightly;
LOAD ducklake;

-- Attach with external Postgres catalog and explicit DATA_PATH
ATTACH 'ducklake:postgres:dbname=ducklake_test host=localhost' AS test_lake 
  (DATA_PATH '/tmp/ducklake_test_data');

USE test_lake;

-- Step 2: Create a table and insert some data
CREATE TABLE test_table (id INTEGER, value VARCHAR);
INSERT INTO test_table VALUES (1, 'a'), (2, 'b'), (3, 'c');

-- Step 3: Perform operations that will schedule files for deletion
-- (e.g., DELETE or UPDATE operations, or merge_adjacent_files)
DELETE FROM test_table WHERE id = 1;
INSERT INTO test_table VALUES (4, 'd'), (5, 'e');

-- Step 4: Expire snapshots to schedule files for deletion
CALL ducklake_expire_snapshots('test_lake', older_than => now() - INTERVAL '0 seconds');

-- Step 5: Verify data_path is correctly stored
SELECT * FROM __ducklake_metadata_test_lake.ducklake_metadata;
-- Returns: data_path = '/tmp/ducklake_test_data/'

-- Step 6: Verify files are scheduled for deletion
SELECT COUNT(*) FROM __ducklake_metadata_test_lake.ducklake_files_scheduled_for_deletion;
-- Returns: N files scheduled

-- Step 7: Verify the scheduled files exist on disk
SELECT path FROM __ducklake_metadata_test_lake.ducklake_files_scheduled_for_deletion LIMIT 1;
-- Returns relative path like: main/test_table/ducklake-xxxxx.parquet
-- File exists at: /tmp/ducklake_test_data/main/test_table/ducklake-xxxxx.parquet ✓

-- Step 8: Attempt cleanup
CALL ducklake_cleanup_old_files('test_lake', dry_run => false, cleanup_all => true);
-- Returns: 0 (no files deleted)

-- Step 9: Verify files are STILL scheduled (not cleaned up)
SELECT COUNT(*) FROM __ducklake_metadata_test_lake.ducklake_files_scheduled_for_deletion;
-- Still returns: N files scheduled (unchanged)

OS:

macOS 15.1

DuckDB Version:

1.4.2

DuckLake Version:

77f2512 (from core_nightly)

DuckDB Client:

R

Hardware:

No response

Full Name:

Giovanni Colitti

Affiliation:

N/A

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • [x] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • [x] Yes, I have

gacolitti avatar Nov 26 '25 15:11 gacolitti

If I run this example, there are no files schedule for deletion. This example creates three files:

  • 1 for the first insert
  • 1 delete file associated to this insert file (for the delete)
  • 1 file for the second insert

If you expire old snapshots, all of these files are still needed to reconstruct the current state and therefore files should not be deleted. If you call rewrite files and then compact and still no files are scheduled for deletion, then that would be a bug I think.

Did you maybe mean something else with this repro?

guillesd avatar Dec 01 '25 14:12 guillesd

Thanks for opening this issue in the DuckLake issue tracker! To resolve this issue, our team needs a reproducible example. This includes:

  • A source code snippet which reproduces the issue.
  • The snippet should be self-contained, i.e., it should contain all imports and should use relative paths instead of hard coded paths (please avoid /Users/JohnDoe/...).
  • A lot of issues can be reproduced with plain SQL code executed in the DuckDB command line client. If you can provide such an example, it greatly simplifies the reproduction process and likely results in a faster fix.
  • If the script needs additional data, please share the data as a CSV, JSON, or Parquet file. Unfortunately, we cannot fix issues that can only be reproduced with a confidential data set. Support contracts allow sharing confidential data with the core DuckDB team under NDA.

For more detailed guidelines on how to create reproducible examples, please visit Stack Overflow's “Minimal, Reproducible Example” page.

duckdblabs-bot avatar Dec 01 '25 14:12 duckdblabs-bot

Yeah, thanks @guillesd. Sorry, I think the example above doesn't show it. But I still think there's an issue here.

Here is R code that shows what I'm talking about. It runs DuckLake against a Postgres-backed catalog with an explicit DATA_PATH, performs a few inserts/deletes plus rewrite + merge to create obsolete files, expires snapshots so two files are scheduled for deletion, and then shows that ducklake_cleanup_old_files() leaves both files (and their metadata entries) untouched

# Minimal reproducible example: ducklake_cleanup_old_files() bug with external Postgres catalog
# Bug: Files scheduled for deletion are not deleted when using Postgres catalog + DATA_PATH

# --- Setup ---
test_db_name <- "ducklake_cleanup_test"
test_data_path <- "/tmp/ducklake_cleanup_test_data"

# Clean slate
if (dir.exists(test_data_path)) unlink(test_data_path, recursive = TRUE)
dir.create(test_data_path, recursive = TRUE)

# Create fresh Postgres database
pg_con <- DBI::dbConnect(RPostgres::Postgres(), dbname = "postgres", host = "localhost")
DBI::dbExecute(pg_con, sprintf(
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = '%s'", test_db_name
))
#> [1] 0
try(DBI::dbExecute(pg_con, sprintf("DROP DATABASE IF EXISTS %s", test_db_name)), silent = TRUE)
#> [1] 0
DBI::dbExecute(pg_con, sprintf("CREATE DATABASE %s", test_db_name))
#> [1] 0
DBI::dbDisconnect(pg_con)

# --- Connect DuckLake with Postgres catalog ---
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "LOAD ducklake;")
#> [1] 0
DBI::dbExecute(con, sprintf(
  "ATTACH 'ducklake:postgres:dbname=%s host=localhost' AS lake (DATA_PATH '%s');",
  test_db_name, test_data_path
))
#> [1] 0

# --- Create data and make files obsolete ---
DBI::dbExecute(con, "CREATE TABLE lake.t (x INT);")
#> [1] 0
DBI::dbExecute(con, "INSERT INTO lake.t VALUES (1), (2), (3);") # file 1
#> [1] 3
DBI::dbExecute(con, "INSERT INTO lake.t VALUES (4), (5);") # file 2
#> [1] 2
DBI::dbExecute(con, "DELETE FROM lake.t WHERE x <= 2;") # delete file
#> [1] 2
DBI::dbExecute(con, "INSERT INTO lake.t VALUES (6), (7);") # file 3
#> [1] 2
DBI::dbExecute(con, "CALL ducklake_rewrite_data_files('lake', 't');") # consolidates -> makes old files obsolete
#> [1] 0
DBI::dbExecute(con, "CALL ducklake_merge_adjacent_files('lake');") # merges small files
#> [1] 0
DBI::dbExecute(con, "CALL ducklake_expire_snapshots('lake', older_than => now());")
#> [1] 0

# --- Check scheduled files ---
scheduled <- DBI::dbGetQuery(
  con,
  "SELECT * FROM __ducklake_metadata_lake.ducklake_files_scheduled_for_deletion"
)
cat("Files scheduled for deletion:", nrow(scheduled), "\n")
#> Files scheduled for deletion: 2
print(scheduled)
#>   data_file_id                                                         path
#> 1            1 main/t/ducklake-019adaa6-6a6c-75ff-9e2d-367a90fef0c1.parquet
#> 2            3 main/t/ducklake-019adaa6-6a7b-793c-9a3b-055486ebf149.parquet
#>   path_is_relative      schedule_start
#> 1             TRUE 2025-12-01 16:02:10
#> 2             TRUE 2025-12-01 16:02:10

# --- Verify files exist on disk ---
data_path <- DBI::dbGetQuery(
  con,
  "SELECT value FROM __ducklake_metadata_lake.ducklake_metadata WHERE key = 'data_path'"
)$value
cat("\nFiles exist on disk:\n")
#> 
#> Files exist on disk:
for (p in scheduled$path) {
  full <- file.path(data_path, p)
  cat("  ", p, "->", file.exists(full), "\n")
}
#>    main/t/ducklake-019adaa6-6a6c-75ff-9e2d-367a90fef0c1.parquet -> TRUE 
#>    main/t/ducklake-019adaa6-6a7b-793c-9a3b-055486ebf149.parquet -> TRUE

# --- Attempt cleanup ---
cat("\nDry run shows files to delete:\n")
#> 
#> Dry run shows files to delete:
print(DBI::dbGetQuery(con, "CALL ducklake_cleanup_old_files('lake', dry_run => true, cleanup_all => true);"))
#>                                                                                           path
#> 1 /tmp/ducklake_cleanup_test_data/main/t/ducklake-019adaa6-6a6c-75ff-9e2d-367a90fef0c1.parquet
#> 2 /tmp/ducklake_cleanup_test_data/main/t/ducklake-019adaa6-6a7b-793c-9a3b-055486ebf149.parquet

cat(
  "\nActual cleanup result:",
  DBI::dbExecute(con, "CALL ducklake_cleanup_old_files('lake', dry_run => false, cleanup_all => true);"),
  "\n"
)
#> 
#> Actual cleanup result: 0

# --- Verify bug ---
still_scheduled <- DBI::dbGetQuery(
  con,
  "SELECT COUNT(*) FROM __ducklake_metadata_lake.ducklake_files_scheduled_for_deletion"
)[[1]]
cat("\nFiles still scheduled after cleanup:", still_scheduled, "\n")
#> 
#> Files still scheduled after cleanup: 2
cat("Files still on disk:", length(list.files(test_data_path, recursive = TRUE)), "\n")
#> Files still on disk: 5

if (still_scheduled > 0) {
  cat("\n*** BUG: cleanup_old_files() did not delete scheduled files ***\n")
}
#> 
#> *** BUG: cleanup_old_files() did not delete scheduled files ***

DBI::dbDisconnect(con, shutdown = TRUE)

Created on 2025-12-01 with reprex v2.1.1

Here is just the SQL calls:

-- 1. Attach DuckLake catalog backed by Postgres + DATA_PATH
ATTACH 'ducklake:postgres:dbname=ducklake_cleanup_test host=localhost'
  AS lake (DATA_PATH '/tmp/ducklake_cleanup_test_data');
USE lake;

-- 2. Table + data churn to create obsolete files
CREATE TABLE t (x INT);

INSERT INTO t VALUES (1), (2), (3);        -- file 1
INSERT INTO t VALUES (4), (5);             -- file 2
DELETE FROM t WHERE x <= 2;                -- delete file
INSERT INTO t VALUES (6), (7);             -- file 3

CALL ducklake_rewrite_data_files('lake', 't');   -- consolidate
CALL ducklake_merge_adjacent_files('lake');      -- optional but mirrors repro
CALL ducklake_expire_snapshots('lake', older_than => now());

-- 3. Confirm files are scheduled for deletion
SELECT *
FROM __ducklake_metadata_lake.ducklake_files_scheduled_for_deletion;

-- 4. (Optional) confirm DATA_PATH and on-disk files
SELECT value
FROM __ducklake_metadata_lake.ducklake_metadata
WHERE key = 'data_path';

-- 5. Show cleanup dry-run identifies files
CALL ducklake_cleanup_old_files('lake', dry_run => true, cleanup_all => true);

-- 6. Actual cleanup (returns 0; bug)
CALL ducklake_cleanup_old_files('lake', dry_run => false, cleanup_all => true);

-- 7. Verify bug: files still scheduled / on disk
SELECT COUNT(*) AS files_still_scheduled
FROM __ducklake_metadata_lake.ducklake_files_scheduled_for_deletion;

-- Manually inspect `/tmp/ducklake_cleanup_test_data` to see the files remain.

gacolitti avatar Dec 01 '25 16:12 gacolitti

Thanks @gacolitti, this definitely happens also with DuckDB as a catalog. I don't think this is related but, but this line CALL ducklake_expire_snapshots('lake', older_than => now()); doesn't do anything since DuckLake ignores this to protect the last snapshot (at least that's what happens for me).

guillesd avatar Dec 03 '25 08:12 guillesd

Thanks for the report, this issue should be fixed by: https://github.com/duckdb/ducklake/pull/596

pdet avatar Dec 04 '25 12:12 pdet