md5_number returns diff results between ducklake / duckdb
What happens?
When generating md5 hashes using md5_number, I am seeing duplicates when inserting into duckalke. If I run the same md5_number function outside of the insert it produces different results (not duplicated).
Additionally when experimenting I noticed this doesn't occur if:
- I insert directly to duckdb and not ducklake
- I change the
pkeycolumn definition fromuhugeinttovarchar
To Reproduce
Reproduction
attach if not exists 'ducklake:duckdb_database.ducklake' as lakehouse;
drop table if exists lakehouse.test;
create table if not exists lakehouse.test(
ticker varchar not null,
correction int not null,
sequence_number ubigint not null,
sip_timestamp ubigint not null,
pkey uhugeint not null,
);
insert into lakehouse.test
select
ticker,
correction,
sequence_number,
sip_timestamp,
md5_number(ticker || sequence_number || sip_timestamp || correction) as pkey
from read_csv_auto('/mnt/lake/sample.csv');
Diff MD5 values
.mode markdown
select *, md5_number(ticker || sequence_number || sip_timestamp || correction) as new_pkey
from lakehouse.test;
| ticker | correction | sequence_number | sip_timestamp | pkey | new_pkey |
|---|---|---|---|---|---|
| REMC | 0 | 1437375 | 1105647211000000000 | 144520619632052850271484061619402571776 | 144520619632052854388488937575290486394 |
| TASR | 0 | 294801 | 1105628688000000000 | 144520619632052850271484061619402571776 | 144520619632052853151872378296537267878 |
OS:
Ubuntu 24.04.3 LTS
DuckDB Version:
v1.4.0-dev3506 (Development Version) a034ddc280
DuckLake Version:
b389c95
DuckDB Client:
CLI, Python
Hardware:
No response
Full Name:
Alan Spires
Affiliation:
Hobby Project
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- [x] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [x] Yes, I have
I will take a look today! Thanks @alanspires
Hi @alanspires I did reproduce this. I do wonder if this is related to the fact that DuckLake seems to support only up to u64, and uhugeint is u128. I will discuss this internally
Parquet files do not have (u)hugeint representation. By default duckdb stores them as doubles, which can cause precision loss. If you want to keep the exact value you must use a string.
D COPY (select md5_number(ticker || sequence_number || sip_timestamp || correction) as new_pkey
from 'sample.csv') to 'p.parquet';
D FROM 'p.parquet'
;
┌────────────────────────┐
│ new_pkey │
│ double │
├────────────────────────┤
│ 1.4452061963205285e+38 │
│ 1.4452061963205285e+38 │
└────────────────────────┘