ducklake icon indicating copy to clipboard operation
ducklake copied to clipboard

"malloc: Heap corruption detected" after running merge_adjacent_files() on PG+S3 partitioned ducklake

Open kiwialec opened this issue 7 months ago • 6 comments

I've managed to reproduce this a couple of times, but only with my data:

  1. Create a new ducklake (PG catalog, S3 object store partitioned on 2 fields)
  2. Copy in ~40GB of data from an iceberg lake (INSERT TO ... FROM s3tables WHERE ..)
  3. Everything works fine - at this point I can query etc and get the expected results
  4. run CALL main.merge_adjacent_files() - appears to succeed but does nothing
  5. All subsequent queries to the ducklake cause the process to crash with a malloc error below
duckdb(36187,0x16b99f000) malloc: Heap corruption detected, free list is damaged at 0x6000037c53e0
*** Incorrect guard value: 105553178690160
duckdb(36187,0x16b99f000) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      duckdb

This happens whether I do any of these queries from my local mac or remote ubuntu.

Confusingly, I exported the contents of pg before and after, and none of the data appears different. Looking through S3, I can't see any modified files.

Happy to run this a couple of times if you let me know how to get useful debugging info out of duckdb

kiwialec avatar May 29 '25 07:05 kiwialec

Thanks for the report!

Does this behavior only happen when using Postgres/S3, or does it also happen locally when using DuckDB + local storage?

Does the behavior happen after reconnecting as well? Or does calling merge_adjacent_files only influence the running process, and the behavior is fine again after reconnecting?

Mytherin avatar May 29 '25 13:05 Mytherin

The behaviour happens when S3 is used as the storage - it did not appear when I used the SSD as storage (using both duckdb and postgres as the catalog).

When it happens, it spoils the ducklake completely - after restarting the process, it will crash any time the ducklake is queried.

kiwialec avatar May 29 '25 17:05 kiwialec

Could you try querying the Parquet files directly? Perhaps there's a particular Parquet file that is causing issues here.

Mytherin avatar May 29 '25 18:05 Mytherin

The behaviour happens when S3 is used as the storage - it did not appear when I used the SSD as storage (using both duckdb and postgres as the catalog).

When it happens, it spoils the ducklake completely - after restarting the process, it will crash any time the ducklake is queried.

I'm running into a similar issue. It appears to work fine with DATA_PATH using local storage on Mac, but it crashes frequently with malloc errors when using S3 storage.

z-ai-lab avatar Jun 08 '25 00:06 z-ai-lab

Could you try querying the Parquet files directly? Perhaps there's a particular Parquet file that is causing issues here.

I tried doing several queries across various columns of all parquets (select avg(x) from parquet_scan('s3://.../ducklake/**');) and all the queries i tried worked. However, retrying the process described in the first post did not cause the malloc error on the day I tried (ducklake extension v 673f44d) - but I haven't had a chance to do more testing since then.

kiwialec avatar Jun 08 '25 08:06 kiwialec

Thanks for checking. An issue we fixed upstream recently was related to auto-loading of secrets - https://github.com/duckdb/duckdb/pull/17650. That could cause crashes when starting DuckLake connected to S3 and immediately issuing a query before secrets are loaded. A work-around is to explicitly instantiate secrets by calling FROM duckdb_secrets(). Perhaps that is also what is going on here?

Mytherin avatar Jun 08 '25 11:06 Mytherin