ducklake icon indicating copy to clipboard operation
ducklake copied to clipboard

contrary to it's name `CALL ducklake_add_data_files(...)` only supports single parquet file

Open onnimonni opened this issue 1 month ago • 3 comments

What happens?

Hey!

I'm trying to add all 88492 parquet files in common crawl to a frozen ducklake to test it out.

But I noticed that the ducklake_add_data_files doesn't support arrays of parquet files as of now and I get following error:

No function matches the given name and argument types 'ducklake_add_data_files(VARCHAR, VARCHAR, VARCHAR[])'. You might need to add explicit type casts.
        Candidate functions:
        ducklake_add_data_files(VARCHAR, VARCHAR, VARCHAR, schema : VARCHAR, hive_partitioning : BOOLEAN, ignore_extra_columns : BOOLEAN, allow_missing : BOOLEAN)

Would you be willing to add support for arrays of parquet files using a single CALL ducklake_add_data_files?

To Reproduce

INSTALL ducklake;
LOAD ducklake;
INSTALL httpfs;
LOAD httpfs;

SET VARIABLE crawl_ids = (
    SELECT ARRAY_AGG(id)
    FROM read_json('https://index.commoncrawl.org/collinfo.json')
    WHERE id NOT IN (
        -- Columnar index not available for these old crawls
        'CC-MAIN-2012',
        'CC-MAIN-2009-2010',
        'CC-MAIN-2008-2009'
    )
);

SET VARIABLE parquet_files = (
    SELECT ARRAY_AGG('https://data.commoncrawl.org/' || column0)
    FROM read_csv(
        list_transform(
            getvariable('crawl_ids'),
            n -> format(
                'https://data.commoncrawl.org/crawl-data/{}/cc-index-table.paths.gz',
                n
            )
        ),
        header=false
    )
);

ATTACH 'ducklake:commoncrawl.ducklake' AS commoncrawl (DATA_PATH 'tmp_always_empty');

-- Create an empty table with the same schema as the most recent parquet file
CREATE OR REPLACE TABLE commoncrawl.archives AS
    FROM read_parquet(list_min(getvariable('parquet_files')))
    WITH NO DATA;

CALL ducklake_add_data_files(
    'commoncrawl',
    'archives',
    list_min(getvariable('parquet_files')),
    allow_missing => true
);

OS:

MacOS

DuckDB Version:

1.4.2

DuckLake Version:

f134ad8

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Onni Hakala

Affiliation:

Freelancer

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

  • [ ] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • [ ] Yes, I have

onnimonni avatar Nov 22 '25 20:11 onnimonni

this can potentially be nice I agree.

guillesd avatar Nov 24 '25 13:11 guillesd

I was able to solve this with python.

I guess idempotency is going to pretty important here since many public datasets are rate limited.

It would just be great to have certain amount of parallel requests going on simultaneously instead of waiting for each single file.

My server has been adding the 88492 files from the example above for the last 20 hours and it still has 8504 files to go.

onnimonni avatar Nov 24 '25 19:11 onnimonni

Yes, I think a good idea indeed would be to pass a list of files (arbitrary length) and let DuckDB do the IO. Also as a side note, you can make your calls concurrent, that should not be a problem and would've sped up the process I believe.

guillesd avatar Nov 25 '25 09:11 guillesd