duckdb-r icon indicating copy to clipboard operation
duckdb-r copied to clipboard

Parquet large file write performance degredation

Open arthurgailes opened this issue 6 months ago • 2 comments

Hello,

duckdb and arrow seem to write parquet files at roughly the same speed until the data gets to about 10+ GB, at which point duckdb is about an order of magnitude slower.

This is very large, but the issue also impacts partitioned writes and COPY x.parquet TO y.parquet workflows.

library(duckdb)
#> Loading required package: DBI
#> Warning: package 'DBI' was built under R version 4.3.3
library(arrow)
#> Warning: package 'arrow' was built under R version 4.3.3
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)
library(glue)
#> Warning: package 'glue' was built under R version 4.3.3

# Generate a large DataFrame
set.seed(123)
n <- 1e9  # Number of rows
df <- data.frame(
  id = 1:n,
  value = rnorm(n),
  category1 = sample(LETTERS, n, replace = TRUE),
  category2 = sample(letters, n, replace = TRUE),
  description = replicate(n, paste(sample(c(letters, LETTERS), 10, replace = TRUE), collapse = ""))
)

df_size <- object.size(df)
df_size_mb <- df_size / (1024^3)
cat("DataFrame size: ", df_size_mb, "GB\n")
#> DataFrame size:  93.13226 GB

# Connect to DuckDB
con <- dbConnect(duckdb::duckdb())

# Copy DataFrame to DuckDB
dbWriteTable(con, "df", df, overwrite = TRUE)

# Create a temporary file
temp_file <- tempfile(fileext = ".parquet")

# Measure the time to write the DataFrame to Parquet using SQL
write_time <- microbenchmark(
  duck = dbExecute(con, glue("COPY (SELECT * FROM df) TO '{temp_file}' (FORMAT 'parquet')")),
  arrow = write_parquet(df, temp_file),
  times = 1
)

print(write_time)
#> Unit: seconds
#>   expr       min        lq      mean    median        uq       max neval
#>   duck 2096.6953 2096.6953 2096.6953 2096.6953 2096.6953 2096.6953     1
#>  arrow  298.3979  298.3979  298.3979  298.3979  298.3979  298.3979     1
cat("Parquet file written to: ", temp_file, "\n")
#> Parquet file written to:  Z:/RTemp\Rtmpo7Y8u1\file63602fc514b3.parquet

# Clean up
dbDisconnect(con, shutdown = TRUE)

Created on 2024-08-03 with reprex v2.1.1

arthurgailes avatar Aug 03 '24 19:08 arthurgailes