Ducklake insertion race condition error: Failed to copy data: ERROR: duplicate key value violates unique constraint "ducklake_data_file_pkey"
What happens?
I consistently hit the following error when I have multiple processes inserting to ducklake
TransactionContext Error: Failed to commit: Failed to commit DuckLake transaction.
Exceeded the maximum retry count of 10 set by the ducklake_max_retry_count setting.
. Consider increasing the value with: e.g., "SET ducklake_max_retry_count = 100;"
Failed to write data file information to DuckLake: Failed to copy data: ERROR: duplicate key value violates unique constraint "ducklake_data_file_pkey"
DETAIL: Key (data_file_id)=(1553) already exists.
CONTEXT: COPY ducklake_data_file, line 1
My current setup
- postgres as catalog db
- local dir as storage db (same issue when using s3)
- multiple instances of the same insert code/process running
More common when there is higher latency to the postgres endpoint. When postgres is on localhost, the issue is not visible.
INSTALL ducklake;
LOAD ducklake;
INSTALL postgres;
LOAD postgres;
ATTACH 'ducklake:postgres:dbname=...' AS warehouse (
DATA_PATH '.data/'
);
USE warehouse;
SQL_INSERT_DF = "INSERT INTO warehouse.{table} SELECT * FROM df;"
def _bulk_insert_as_df(self, table: str, items: list,):
if not items: return
df = pd.DataFrame([item.model_dump() for item in items])
cursor = self.db.cursor()
cursor.execute(SQL_INSERT_DF.format(table=table))
cursor.close()
To Reproduce
-- SETUP
INSTALL ducklake;
LOAD ducklake;
INSTALL postgres;
LOAD postgres;
ATTACH 'ducklake:postgres:dbname=catalogdb' AS warehouse (
DATA_PATH '.data/'
);
USE warehouse;
CREATE TABLE IF NOT EXISTS bean_cores (
url VARCHAR NOT NULL,
kind VARCHAR NOT NULL,
title VARCHAR,
title_length UINT16,
summary TEXT,
summary_length UINT16,
content TEXT,
content_length UINT16,
restricted_content BOOLEAN, -- 0 for False, 1 for True
author VARCHAR,
source VARCHAR NOT NULL,
image_url VARCHAR,
created TIMESTAMP NOT NULL, -- ISO format datetime string
collected TIMESTAMP NOT NULL -- ISO format datetime string
);
CREATE TABLE IF NOT EXISTS bean_embeddings (
url VARCHAR NOT NULL, -- Foreign key to Bean.url
embedding FLOAT[] NOT NULL
);
CREATE TABLE IF NOT EXISTS bean_gists (
url VARCHAR NOT NULL, -- Foreign key to Bean.url
gist TEXT NOT NULL,
regions VARCHAR[],
entities VARCHAR[]
);
SQL_INSERT_DF = "INSERT INTO warehouse.{table} SELECT * FROM df;"
def _bulk_insert_as_df(self, table: str, items: list,):
if not items: return
df = pd.DataFrame([item.model_dump() for item in items])
cursor = self.db.cursor()
cursor.execute(SQL_INSERT_DF.format(table=table))
cursor.close()
OS:
Ubuntu 22.04
DuckDB Version:
1.4.0
DuckLake Version:
0.2
DuckDB Client:
Python
Hardware:
4 vCPU, 8 GB RAM
Full Name:
Soumit S Rahman
Affiliation:
cafecito.tech
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- [x] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [x] Yes, I have
Hi @soumitsalman can you check if indeed you are using Ducklake 0.2? DuckLake 0.3 should be the correct version to run 1.4 duckdb (I would think 0.2 should be incompatible).
On another note, have you try setting ducklake_max_retry_count to a higher number as recommended in the trace?
my bad the version is indeed 0.3
raising ducklake_max_retry_count does not solve the problem and is not going to. It seems that the issue is due to race condition on how ducklake names parquet files when inserted. For multiple processes writing, unless the postgres instance is on the same machine or the same region of the same cloud provider (basically network speed) upping the ducklake_max_retry_count doesn't help since ducklake retries with the same file name. So the problem persists. I have tried upping the function retry count in my own code when there is an exception. Ducklake treats each try as a new insert and hence generates a new filename, that reduces the number of net failures in my code but not how many times Ducklake fails. My suggestion is that, perhaps there can be some fix within ducklake insertion logic, where if the same filename is found (OR a transaction error like this happens), then change the filename
It is actually not the filename but the data_file_id, here you have an explanation of how this is generated https://ducklake.select/docs/stable/specification/tables/ducklake_data_file. This key needs to always be taken from the next_file_id in order to keep snapshot consistency. If you just select a random key then you loose this consistency.
I'm happy to dig deeper if you can produce a reproducible example that is a bit more finished (including the multiprocessing part). Something I can easily run in my environment with minor tweaks (credentials for the remote postgres server for example).
I am seeing exactly the same problem with a very similar setup. We also patched it with a retry on our side but it's having an impact on performances. I am adding that our concurrent writes are on different tables but we observe the problem nevertheless.
Failed to commit: Failed to commit DuckLake transaction. Exceeded the maximum retry count of 10 set by the ducklake_max_retry_count setting. . Consider increasing the value with: e.g., "SET ducklake_max_retry_count = 100;" Failed to write data file information to DuckLake: Failed to copy data: ERROR: duplicate key value violates unique constraint "ducklake_data_file_pkey" DETAIL: Key (data_file_id)=(3330741) already exists.