ducklake icon indicating copy to clipboard operation
ducklake copied to clipboard

md5_number returns diff results between ducklake / duckdb

Open alanspires opened this issue 3 months ago • 2 comments

What happens?

When generating md5 hashes using md5_number, I am seeing duplicates when inserting into duckalke. If I run the same md5_number function outside of the insert it produces different results (not duplicated).

Additionally when experimenting I noticed this doesn't occur if:

  • I insert directly to duckdb and not ducklake
  • I change the pkey column definition from uhugeint to varchar

To Reproduce

Reproduction

sample.csv

attach if not exists 'ducklake:duckdb_database.ducklake' as lakehouse;
drop table if exists lakehouse.test;

create table if not exists lakehouse.test(
    ticker varchar not null,
    correction int not null,
    sequence_number ubigint not null,
    sip_timestamp ubigint not null,
    pkey uhugeint not null,
);

insert into lakehouse.test
select
    ticker,
    correction,
    sequence_number,
    sip_timestamp,
    md5_number(ticker || sequence_number || sip_timestamp || correction) as pkey
from read_csv_auto('/mnt/lake/sample.csv');

Diff MD5 values

.mode markdown
select *, md5_number(ticker || sequence_number || sip_timestamp || correction) as new_pkey
from lakehouse.test;
ticker correction sequence_number sip_timestamp pkey new_pkey
REMC 0 1437375 1105647211000000000 144520619632052850271484061619402571776 144520619632052854388488937575290486394
TASR 0 294801 1105628688000000000 144520619632052850271484061619402571776 144520619632052853151872378296537267878

OS:

Ubuntu 24.04.3 LTS

DuckDB Version:

v1.4.0-dev3506 (Development Version) a034ddc280

DuckLake Version:

b389c95

DuckDB Client:

CLI, Python

Hardware:

No response

Full Name:

Alan Spires

Affiliation:

Hobby Project

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • [x] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • [x] Yes, I have

alanspires avatar Aug 31 '25 20:08 alanspires

I will take a look today! Thanks @alanspires

guillesd avatar Sep 02 '25 07:09 guillesd

Hi @alanspires I did reproduce this. I do wonder if this is related to the fact that DuckLake seems to support only up to u64, and uhugeint is u128. I will discuss this internally

guillesd avatar Sep 03 '25 15:09 guillesd

Parquet files do not have (u)hugeint representation. By default duckdb stores them as doubles, which can cause precision loss. If you want to keep the exact value you must use a string.

D COPY (select md5_number(ticker || sequence_number || sip_timestamp || correction) as new_pkey
  from 'sample.csv') to 'p.parquet';
D FROM 'p.parquet'
  ;
┌────────────────────────┐
│        new_pkey        │
│         double         │
├────────────────────────┤
│ 1.4452061963205285e+38 │
│ 1.4452061963205285e+38 │
└────────────────────────┘

pdet avatar Dec 11 '25 00:12 pdet