parquetize Add a `duckdb_to_parquet` using low level arrow functions

Just to signal that I wrote an enhanced interface to SQL that is in https://github.com/jllipatz/SQL. That's just WIP...

Originally posted by @jllipatz in https://github.com/ddotta/parquetize/issues/27#issuecomment-1508256414

Apr 14 '23 13:04 ddotta

_Originally posted by @jllipatz in https://github.com/ddotta/parquetize/pull/27#issuecomment-1500516472

Hello,

The couple dbSendQuery, dbFetch doesn't fit with duckdb : the query is solved before reaching the dbFetch overfilling the RAM . Here is a solution that works without consuming a lot of RAM. Also it runs much more faster than simly including a COPY TO parquet in the SQL query. Perhaps should it be the beginning of a new function in {parquetize} if somebody adds the partitioning ways that exist for the other functions.

`SQL2parquet <- function(query,path,chunk_size=1e6) { con <- dbConnect(duckdb::duckdb())

reader <- duckdb_fetch_record_batch( dbSendQuery(con,query,arrow=TRUE), chunk_size=chunk_size)

file <- FileOutputStream$create(path) batch <- reader$read_next_batch() if (!is.null(batch)) { s <- batch$schema writer <- ParquetFileWriter$create(s,file, properties = ParquetWriterProperties$create(names(s)))

i <- 0
while (!is.null(batch)) {
  i <- i+1
  message(sprintf("%d, %d rows",i,nrow(batch)))
  writer$WriteTable(arrow_table(batch),chunk_size=chunk_size)
  batch <- NULL; gc()
  batch <- reader$read_next_batch()
}

writer$Close()

} file$close() } `

Apr 14 '23 13:04 ddotta

_Originally posted by @nbc in https://github.com/ddotta/parquetize/pull/27#issuecomment-1500516472

Hi @jllipatz thanks, I'm very interested, I think preparing a parquet file in duckdb could be a good use case but I don't feel comfortable enough in arrow's guts to start working on this for the moment. I must explore more.

Apr 14 '23 13:04 ddotta

I agree with @nbc. I find your idea @jjlipatz really very interesting and promising 🚀 Only this represents a fairly high entry cost to master all these low-level features from arrow 😢

Apr 14 '23 13:04 ddotta