duckplyr Read in multiple csvs when file paths aren't amenable to glob syntax

Read in multiple csvs when file paths aren't amenable to glob syntax

Open nicki-dese opened this issue 1 year ago • 2 comments

I routinely work with multiple large csvs with a mess of file paths that aren't amenable to glob syntax. When working with duckdb I can supply these as, say SELECT * FROM read_csv([file_1.csv, file_2.csv]) and that works. I can't figure out how to do the equivalent in duckplyr.

I've tried:

file_paths <- c("file_1.csv", "file_2.csv) OR file_paths <- list("file_1.csv", "file_2.csv")

duckplyr_df_from_csv(file_paths) %>% do_something

It doesn't error, but it only reads in the first file.

Is this possible? if so how? If not, I think there should at least be a warning if a list or vector of multiple file paths are passed.

May 02 '24 01:05 nicki-dese

Thanks. Code like file_paths %>% map(duckplyr_df_from_csv) %>% bind_rows() has worked for me in practice, but I agree that this should be streamlined. Would you like to contribute a PR?

May 02 '24 03:05 krlmlr

I hadn't thouight to use map, thanks for the tip.

I'm sorry I do not have the experience or knowledge of how to do a PR :(

May 02 '24 03:05 nicki-dese

bind_rows() reads into memory, %>% reduce(union_all) is better but will also read into memory in duckplyr 0.4.0 (works better in duckplyr 0.3.0): https://github.com/tidyverse/dplyr/pull/7049 .

What should work is duckplyr_df_from_csv("file_*.csv"), but I hear this is not an option here, and I'm seeing mixed results too: https://github.com/duckdb/duckdb/issues/12903 .

Action item: Implement bind_rows() to use reduce(union_all) under the hood.

Jul 08 '24 20:07 krlmlr

The action items here are a subset of those in https://github.com/duckdblabs/duckplyr/issues/181#issuecomment-2215275517, let's move the discussion there.

Jul 08 '24 20:07 krlmlr

duckplyr duckplyr copied to clipboard

Read in multiple csvs when file paths aren't amenable to glob syntax

duckplyr
duckplyr copied to clipboard