duckplyr
duckplyr copied to clipboard
Read in multiple csvs when file paths aren't amenable to glob syntax
I routinely work with multiple large csvs with a mess of file paths that aren't amenable to glob syntax. When working with duckdb I can supply these as, say SELECT * FROM read_csv([file_1.csv, file_2.csv]) and that works. I can't figure out how to do the equivalent in duckplyr.
I've tried:
file_paths <- c("file_1.csv", "file_2.csv) OR
file_paths <- list("file_1.csv", "file_2.csv")
duckplyr_df_from_csv(file_paths) %>% do_something
It doesn't error, but it only reads in the first file.
Is this possible? if so how? If not, I think there should at least be a warning if a list or vector of multiple file paths are passed.
Thanks. Code like file_paths %>% map(duckplyr_df_from_csv) %>% bind_rows() has worked for me in practice, but I agree that this should be streamlined. Would you like to contribute a PR?
I hadn't thouight to use map, thanks for the tip.
I'm sorry I do not have the experience or knowledge of how to do a PR :(
bind_rows() reads into memory, %>% reduce(union_all) is better but will also read into memory in duckplyr 0.4.0 (works better in duckplyr 0.3.0): https://github.com/tidyverse/dplyr/pull/7049 .
What should work is duckplyr_df_from_csv("file_*.csv"), but I hear this is not an option here, and I'm seeing mixed results too: https://github.com/duckdb/duckdb/issues/12903 .
Action item: Implement bind_rows() to use reduce(union_all) under the hood.
The action items here are a subset of those in https://github.com/duckdblabs/duckplyr/issues/181#issuecomment-2215275517, let's move the discussion there.