ducklake icon indicating copy to clipboard operation
ducklake copied to clipboard

Can't use `OVERRIDE_DATA_PATH` with frozen ducklakes

Open onnimonni opened this issue 1 month ago • 1 comments

What happens?

I was checking if one can change the data access path in frozen Ducklakes.

@guillesd mentioned in the Discord that one could override the DATA_PATH with frozen ducklake: https://discord.com/channels/909674491309850675/1376911883486888048/1442477672914747472

Common Crawl data can be accessed both through cloudfront (unauthenticated and rate limited) or s3 ( authenticated and without limits).

I created a reproduction script which shows that even though the DATA_PATH is overwritten to S3 the files in frozen ducklake are not accessed though S3 interface.

Maybe it's not supposed to work with frozen ducklakes yet?

Remember to export your credentials as env export AWS_ACCESS_KEY_ID='XXXXXX' AWS_SECRET_ACCESS_KEY='YYYYYY' or use your AWS provider config to authenticate with S3 for the test.

To Reproduce

SET autoinstall_known_extensions=1;
SET autoload_known_extensions=1;

-- Start with local directory
SET VARIABLE data_path = 'tmp_always_empty';

-- We need to create this with s3:// path otherwise we get
-- Not implemented Error:
-- HTTPFileSystem: DirectoryExists is not implemented!
-- We can't write to the remote but this allows us to use OVERRIDE_DATA_PATH to change to S3
ATTACH 'ducklake:commoncrawl.ducklake' AS commoncrawl (
    -- By using the 'cc-main' as schema the production urls would match perfectly
    -- FIXME: But somehow we can't re attach if we use that
    --METADATA_SCHEMA 'cc-main', 
    DATA_PATH getvariable('data_path')
);

-- Override to https path for reading
SET VARIABLE data_path = 'https://data.commoncrawl.org/cc-index/table';

SET VARIABLE common_crawl_new_file = getvariable('data_path') || '/cc-main/warc/crawl=CC-MAIN-2025-43/subset=warc/part-00299-08d3d8a4-29d7-4627-90c5-75c34ee698ca.c000.gz.parquet';

CREATE TABLE IF NOT EXISTS commoncrawl.warc AS
    FROM read_parquet(getvariable('common_crawl_new_file'))
    WITH NO DATA;
-- Use same kind of partitioning info as they use in production
ALTER TABLE commoncrawl.warc SET PARTITIONED BY (crawl, subset);

CALL ducklake_add_data_files(
    'commoncrawl',
    'warc',
    getvariable('common_crawl_new_file')
);

-- Between CC-MAIN-2013-20 and CC-MAIN-2021-43 'fetch_time' does not have timezone
SET VARIABLE common_crawl_old_file = getvariable('data_path') || '/cc-main/warc/crawl=CC-MAIN-2021-43/subset=warc/part-00299-16202947-a809-4711-8221-79ab0a79d5b1.c000.gz.parquet';

CREATE TABLE IF NOT EXISTS commoncrawl.legacy_timestamp_warc AS
    FROM read_parquet(getvariable('common_crawl_old_file'))
    WITH NO DATA;
-- Use same kind of partitioning info as they use in production
ALTER TABLE commoncrawl.legacy_timestamp_warc SET PARTITIONED BY (crawl, subset);

CALL ducklake_add_data_files(
    'commoncrawl',
    'legacy_timestamp_warc',
    getvariable('common_crawl_old_file')
);

-- Create a unified view which supports both new and old timestamp formats
CREATE VIEW commoncrawl.view AS (
    FROM commoncrawl.warc
    UNION ALL BY NAME
    SELECT * REPLACE (fetch_time::TIMESTAMPTZ AS fetch_time)
    FROM commoncrawl.legacy_timestamp_warc
);

-- Show everything is working
USE commoncrawl;
SHOW TABLES;

-- Stop using the database
USE memory;
DETACH commoncrawl;

-- https://github.com/duckdb/ducklake/issues/576
-- If this would work then we could have a single table for all columnar crawls
-- ALTER TABLE commoncrawl.legacy_timestamp_warc ALTER COLUMN fetch_time TYPE TIMESTAMPTZ;

-- Re-attach with direct https access 
ATTACH 'ducklake:commoncrawl.ducklake' AS commoncrawl (
    DATA_PATH 'https://data.commoncrawl.org/cc-index/table',
    --METADATA_SCHEMA 'cc-main',
    OVERRIDE_DATA_PATH TRUE
);

USE commoncrawl;
SHOW TABLES;

USE memory;
DETACH commoncrawl;

-- Then test s3 with access through env
CREATE SECRET IF NOT EXISTS commoncrawl_s3 (
    TYPE S3,
    KEY_ID getenv('AWS_ACCESS_KEY_ID'),
    SECRET getenv('AWS_SECRET_ACCESS_KEY'),
    REGION 'us-east-1'
);

ATTACH 'ducklake:commoncrawl.ducklake' AS commoncrawl (
    DATA_PATH 's3://commoncrawl/cc-index/table',
    --METADATA_SCHEMA 'cc-main',
    OVERRIDE_DATA_PATH TRUE
);

USE commoncrawl;
SHOW TABLES;

CALL enable_logging('HTTP');

-- Ask only for information which is present in the file names
SELECT DISTINCT(crawl) FROM commoncrawl.view;

-- NOTE: This should show S3 requests and not data.commoncrawl.org requests
FROM duckdb_logs_parsed('HTTP');

OS:

MacOS

DuckDB Version:

1.4.2

DuckLake Version:

f134ad8

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Onni Hakala

Affiliation:

Freelancer

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

  • [ ] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • [ ] Yes, I have

onnimonni avatar Nov 24 '25 13:11 onnimonni

So the problem here is that when we add_data_files, because we theoretically want the user to specify a file wherever they want, then we store absolute paths instead of relative paths, i.e.:

FROM __ducklake_metadata_commoncrawl.ducklake_data_file select path_is_relative, path;
┌──────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ path_is_relative │                                                                      path                                                                       │
│     boolean      │                                                                     varchar                                                                     │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ false            │ https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-43/subset=warc/part-00299-08d3d8a4-29d7-4627-90c5-75c34ee698ca.c0…  │
│ false            │ https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-43/subset=warc/part-00299-16202947-a809-4711-8221-79ab0a79d5b1.c0…  │
└──────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

I am unsure whether this is a bug... it seems to me that this is expected behaviour, otherwise added data files need to always be in the default path to be "usable". I think this is just one other example that we need better consistency, however I am unsure that the behaviour of this function will change.

@pdet what do u think?

guillesd avatar Nov 24 '25 14:11 guillesd

This is expected behavior.

Paths are only relative if they are the same paths as the table being created. If not, we will store absolute paths.

This looks more like a feature request, I'll convert it to discussions.

pdet avatar Dec 05 '25 18:12 pdet