contrary to it's name `CALL ducklake_add_data_files(...)` only supports single parquet file
What happens?
Hey!
I'm trying to add all 88492 parquet files in common crawl to a frozen ducklake to test it out.
But I noticed that the ducklake_add_data_files doesn't support arrays of parquet files as of now and I get following error:
No function matches the given name and argument types 'ducklake_add_data_files(VARCHAR, VARCHAR, VARCHAR[])'. You might need to add explicit type casts.
Candidate functions:
ducklake_add_data_files(VARCHAR, VARCHAR, VARCHAR, schema : VARCHAR, hive_partitioning : BOOLEAN, ignore_extra_columns : BOOLEAN, allow_missing : BOOLEAN)
Would you be willing to add support for arrays of parquet files using a single CALL ducklake_add_data_files?
To Reproduce
INSTALL ducklake;
LOAD ducklake;
INSTALL httpfs;
LOAD httpfs;
SET VARIABLE crawl_ids = (
SELECT ARRAY_AGG(id)
FROM read_json('https://index.commoncrawl.org/collinfo.json')
WHERE id NOT IN (
-- Columnar index not available for these old crawls
'CC-MAIN-2012',
'CC-MAIN-2009-2010',
'CC-MAIN-2008-2009'
)
);
SET VARIABLE parquet_files = (
SELECT ARRAY_AGG('https://data.commoncrawl.org/' || column0)
FROM read_csv(
list_transform(
getvariable('crawl_ids'),
n -> format(
'https://data.commoncrawl.org/crawl-data/{}/cc-index-table.paths.gz',
n
)
),
header=false
)
);
ATTACH 'ducklake:commoncrawl.ducklake' AS commoncrawl (DATA_PATH 'tmp_always_empty');
-- Create an empty table with the same schema as the most recent parquet file
CREATE OR REPLACE TABLE commoncrawl.archives AS
FROM read_parquet(list_min(getvariable('parquet_files')))
WITH NO DATA;
CALL ducklake_add_data_files(
'commoncrawl',
'archives',
list_min(getvariable('parquet_files')),
allow_missing => true
);
OS:
MacOS
DuckDB Version:
1.4.2
DuckLake Version:
f134ad8
DuckDB Client:
CLI
Hardware:
No response
Full Name:
Onni Hakala
Affiliation:
Freelancer
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have not tested with any build
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Did you include all code required to reproduce the issue?
- [ ] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [ ] Yes, I have
this can potentially be nice I agree.
I was able to solve this with python.
I guess idempotency is going to pretty important here since many public datasets are rate limited.
It would just be great to have certain amount of parallel requests going on simultaneously instead of waiting for each single file.
My server has been adding the 88492 files from the example above for the last 20 hours and it still has 8504 files to go.
Yes, I think a good idea indeed would be to pass a list of files (arbitrary length) and let DuckDB do the IO. Also as a side note, you can make your calls concurrent, that should not be a problem and would've sped up the process I believe.