Columns from hive partitioning should be used to skip reading external parquet files
What happens?
Hey!
Ducklake is reading remote files unneccessarily when all requested information is available in hive partitioned columns.
In this situation remote files should not be accessed at all because all of the needed information is already contained in the data_file paths in ducklake_list_files().
This severely affects performance when there are thousands of parquet files.
It would be great if the hive partioning columns could be used to skip reading most of the files when it's not needed at all.
To Reproduce
ATTACH 'ducklake:commoncrawl.ducklake' AS commoncrawl (DATA_PATH 'tmp_always_empty');
SET VARIABLE common_crawl_old_file = 'https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2014-41/subset=warc/part-00103-e1115d40-d2cc-4445-873c-2b206f427726.c000.gz.parquet';
SET VARIABLE common_crawl_new_file = 'https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-47/subset=warc/part-00299-f8733a6e-b0cb-407b-b1b6-f5eed82915da.c000.gz.parquet';
CREATE TABLE IF NOT EXISTS commoncrawl.old_files AS
FROM read_parquet(getvariable('common_crawl_old_file'))
WITH NO DATA;
CREATE TABLE IF NOT EXISTS commoncrawl.new_files AS
FROM read_parquet(getvariable('common_crawl_new_file'))
WITH NO DATA;
CALL ducklake_add_data_files(
'commoncrawl',
'old_files',
getvariable('common_crawl_old_file')
);
CALL ducklake_add_data_files(
'commoncrawl',
'new_files',
getvariable('common_crawl_new_file')
);
-- Create unified view which supports both new and old timestamp formats
CREATE VIEW commoncrawl.view AS (
FROM commoncrawl.new_files
UNION ALL BY NAME
SELECT * REPLACE (fetch_time::TIMESTAMPTZ AS fetch_time)
FROM commoncrawl.old_files
);
Then I closed the duckdb session with ctrl+c and opened the same ducklake file again (so that it doesn't use the cached parquet files).
-- Then logging out and again in
CALL enable_logging('HTTP');
-- Check which files are stored in ducklake
SELECT data_file FROM ducklake_list_files('commoncrawl','old_files')
UNION ALL
SELECT data_file FROM ducklake_list_files('commoncrawl','new_files');
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ data_file │
│ varchar │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2014-41/subset=warc/part-00103-e1115d40-d2cc-4445-873c-2b206f427726.c000.gz.parquet │
│ https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-47/subset=warc/part-00299-f8733a6e-b0cb-407b-b1b6-f5eed82915da.c000.gz.parquet │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
-- Request only for information which is present in the file urls
SELECT DISTINCT(crawl) FROM commoncrawl.view;
-- No http requests should be needed duckdb still reads remote files
SELECT request FROM duckdb_logs_parsed('HTTP');
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ request │
│ struct("type" varchar, url varchar, start_time timestamp with time zone, duration_ms bigint, headers map(varchar, varchar)) │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'type': GET, 'url': 'https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-47/subset=warc/part-00299-f8733a6e-b0cb-407b-b1b6-f5eed82915da.c000.gz.parquet', 'start_time': '2025-11-24 10:… │
│ {'type': GET, 'url': 'https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2014-41/subset=warc/part-00103-e1115d40-d2cc-4445-873c-2b206f427726.c000.gz.parquet', 'start_time': '2025-11-24 10:… │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
OS:
MacOS
DuckDB Version:
1.4.2
DuckLake Version:
77f2512
DuckDB Client:
CLI
Hardware:
No response
Full Name:
Onni Hakala
Affiliation:
Freelancer
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have not tested with any build
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Did you include all code required to reproduce the issue?
- [ ] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [ ] Yes, I have
I can't reproduce this. Are you sure those get requests are from this query? SELECT DISTINCT(crawl) FROM commoncrawl.view;?
Thanks for opening this issue in the DuckLake issue tracker! To resolve this issue, our team needs a reproducible example. This includes:
- A source code snippet which reproduces the issue.
- The snippet should be self-contained, i.e., it should contain all imports and should use relative paths instead of hard coded paths (please avoid
/Users/JohnDoe/...). - A lot of issues can be reproduced with plain SQL code executed in the DuckDB command line client. If you can provide such an example, it greatly simplifies the reproduction process and likely results in a faster fix.
- If the script needs additional data, please share the data as a CSV, JSON, or Parquet file. Unfortunately, we cannot fix issues that can only be reproduced with a confidential data set. Support contracts allow sharing confidential data with the core DuckDB team under NDA.
For more detailed guidelines on how to create reproducible examples, please visit Stack Overflow's “Minimal, Reproducible Example” page.
I can't reproduce this. Are you sure those get requests are from this query?
SELECT DISTINCT(crawl) FROM commoncrawl.view;?
Hey and thanks for looking into this 😊!
Did you create the frozen ducklake first and then closed the duckdb session and start new session?
The point is to simulate a frozen ducklake which someone else downloads and uses for the first time.
I can definitely reproduce this on my mac and the requests seem to come exactly from the SELECT DISTINCT(crawl):
$ ls -lah commoncrawl.ducklake
Permissions Size User Date Modified Name
.rw-r--r-- 5.3M onnimonni 24 Nov 21:12 commoncrawl.ducklake
$ duckdb
DuckDB v1.4.2 (Andium) 68d7555f68
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D ATTACH 'ducklake:commoncrawl.ducklake' AS commoncrawl;
D CALL enable_logging('HTTP');
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows │
└─────────┘
D SELECT request FROM duckdb_logs_parsed('HTTP');
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ request │
│ struct("type" varchar, url varchar, start_time timestamp with time zone, duration_ms bigint, headers map(varchar, varchar)) │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 0 rows │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D SELECT data_file FROM ducklake_list_files('commoncrawl','old_files')
UNION ALL
SELECT data_file FROM ducklake_list_files('commoncrawl','new_files');
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ data_file │
│ varchar │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2014-41/subset=warc/part-00103-e1115d40-d2cc-4445-873c-2b206f427726.c000.gz.parquet │
│ https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-47/subset=warc/part-00299-f8733a6e-b0cb-407b-b1b6-f5eed82915da.c000.gz.parquet │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D SELECT request FROM duckdb_logs_parsed('HTTP');
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ request │
│ struct("type" varchar, url varchar, start_time timestamp with time zone, duration_ms bigint, headers map(varchar, varchar)) │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 0 rows │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D SELECT DISTINCT(crawl) FROM commoncrawl.view;
┌─────────────────┐
│ crawl │
│ varchar │
├─────────────────┤
│ CC-MAIN-2014-41 │
│ CC-MAIN-2025-47 │
└─────────────────┘
D SELECT request FROM duckdb_logs_parsed('HTTP');
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ request │
│ struct("type" varchar, url varchar, start_time timestamp with time zone, duration_ms bigint, headers map(varchar, varchar)) │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'type': GET, 'url': 'https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-47/subset=warc/part-00299-f8733a6e-b0cb-407b-b1b6-f5eed82915da.c000.gz.parquet', 'start_time': '2025-11-24 21:… │
│ {'type': GET, 'url': 'https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2014-41/subset=warc/part-00103-e1115d40-d2cc-4445-873c-2b206f427726.c000.gz.parquet', 'start_time': '2025-11-24 21:… │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
You haven't really specified that the table is partitioned after creation (via ALTER TABLE, so when you add the files no partition metadata is saved).
I might be missing something but in this example it's not altered? It's accessed through a VIEW. Does that affect that no partition metadata is saved?
I'm running exactly the same commands as above to create it and then attaching to the same commoncrawl.ducklake again.
What I meant is that you are not partitioning the table https://ducklake.select/docs/stable/duckdb/advanced_features/partitioning
However this doesn't matter because right now we don't seem to be supporting adding partition value information from ducklake_add_data_files. https://ducklake.select/docs/stable/specification/tables/ducklake_file_partition_value.
I do understand your issue but we are mixing things:
- Without stating that a table is partitioned DuckLake will not infer this from a random data path that you pass on
ducklake_add_data_files. - Making more or less storage requests is kind of a different issue and one that we are looking into in general.
- What could be a nice enhancement (or even a bug fix) would be to add partition values to ducklake_file_partition_value wheneve we do
ducklake_add_data_files. BUt this will only happen if the table is first set to be partitioned (ALTER TABLE <t> SET PARTITIONED BY).
Thanks for the explanation!
Would it be possible to infer the partitions already at this step?
CREATE TABLE IF NOT EXISTS commoncrawl.old_files AS
FROM read_parquet(getvariable('common_crawl_old_file'))
WITH NO DATA;
The read_parquet seems to know the difference between hive columns and non hive columns anyway.
At least for me as an user it would make sense but maybe this has some downstream issues on duckdb overall which I'm not aware of.
Actually even better would be that duckdb notices that there are hive partitions when using:
CREATE TABLE IF NOT EXISTS commoncrawl.old_files AS
FROM read_parquet(getvariable('common_crawl_old_file'))
WITH NO DATA;
And would fail.
It would be great that it would then ask from the user to use WITH PARTITIONS or WITH NO PARTITIONS.
Maybe not exactly these names but something like this to explictly know what the intention is 😊.
What could be a nice enhancement (or even a bug fix) would be to add partition values to ducklake_file_partition_value wheneve we do ducklake_add_data_files. BUt this will only happen if the table is first set to be partitioned (ALTER TABLE <t> SET PARTITIONED BY).
^ @guillesd this is functionality that would really benefit one of my use cases.
Yes I agree @chaegordon ! Although this issue is a bit missleading I think the point should be --> partition metadata should be populated on ducklake_add_data_files
I think the point should be --> partition metadata should be populated on ducklake_add_data_files
Yeah this is better explanation. I will alter the title too 👍