Add a `duckdb_to_parquet` using low level arrow functions
Just to signal that I wrote an enhanced interface to SQL that is in https://github.com/jllipatz/SQL. That's just WIP...
Originally posted by @jllipatz in https://github.com/ddotta/parquetize/issues/27#issuecomment-1508256414
_Originally posted by @jllipatz in https://github.com/ddotta/parquetize/pull/27#issuecomment-1500516472
Hello,
The couple dbSendQuery, dbFetch doesn't fit with duckdb : the query is solved before reaching the dbFetch overfilling the RAM . Here is a solution that works without consuming a lot of RAM. Also it runs much more faster than simly including a COPY TO parquet in the SQL query. Perhaps should it be the beginning of a new function in {parquetize} if somebody adds the partitioning ways that exist for the other functions.
`SQL2parquet <- function(query,path,chunk_size=1e6) { con <- dbConnect(duckdb::duckdb())
reader <- duckdb_fetch_record_batch( dbSendQuery(con,query,arrow=TRUE), chunk_size=chunk_size)
file <- FileOutputStream$create(path) batch <- reader$read_next_batch() if (!is.null(batch)) { s <- batch$schema writer <- ParquetFileWriter$create(s,file, properties = ParquetWriterProperties$create(names(s)))
i <- 0
while (!is.null(batch)) {
i <- i+1
message(sprintf("%d, %d rows",i,nrow(batch)))
writer$WriteTable(arrow_table(batch),chunk_size=chunk_size)
batch <- NULL; gc()
batch <- reader$read_next_batch()
}
writer$Close()
} file$close() } `
_Originally posted by @nbc in https://github.com/ddotta/parquetize/pull/27#issuecomment-1500516472
Hi @jllipatz thanks, I'm very interested, I think preparing a parquet file in duckdb could be a good use case but I don't feel comfortable enough in arrow's guts to start working on this for the moment. I must explore more.
I agree with @nbc. I find your idea @jjlipatz really very interesting and promising 🚀 Only this represents a fairly high entry cost to master all these low-level features from arrow 😢