duckdb-r
duckdb-r copied to clipboard
Parquet large file write performance degredation
Hello,
duckdb
and arrow seem to write parquet files at roughly the same speed until the data gets to about 10+ GB, at which point duckdb is about an order of magnitude slower.
This is very large, but the issue also impacts partitioned writes and COPY x.parquet TO y.parquet
workflows.
library(duckdb)
#> Loading required package: DBI
#> Warning: package 'DBI' was built under R version 4.3.3
library(arrow)
#> Warning: package 'arrow' was built under R version 4.3.3
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
library(glue)
#> Warning: package 'glue' was built under R version 4.3.3
# Generate a large DataFrame
set.seed(123)
n <- 1e9 # Number of rows
df <- data.frame(
id = 1:n,
value = rnorm(n),
category1 = sample(LETTERS, n, replace = TRUE),
category2 = sample(letters, n, replace = TRUE),
description = replicate(n, paste(sample(c(letters, LETTERS), 10, replace = TRUE), collapse = ""))
)
df_size <- object.size(df)
df_size_mb <- df_size / (1024^3)
cat("DataFrame size: ", df_size_mb, "GB\n")
#> DataFrame size: 93.13226 GB
# Connect to DuckDB
con <- dbConnect(duckdb::duckdb())
# Copy DataFrame to DuckDB
dbWriteTable(con, "df", df, overwrite = TRUE)
# Create a temporary file
temp_file <- tempfile(fileext = ".parquet")
# Measure the time to write the DataFrame to Parquet using SQL
write_time <- microbenchmark(
duck = dbExecute(con, glue("COPY (SELECT * FROM df) TO '{temp_file}' (FORMAT 'parquet')")),
arrow = write_parquet(df, temp_file),
times = 1
)
print(write_time)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> duck 2096.6953 2096.6953 2096.6953 2096.6953 2096.6953 2096.6953 1
#> arrow 298.3979 298.3979 298.3979 298.3979 298.3979 298.3979 1
cat("Parquet file written to: ", temp_file, "\n")
#> Parquet file written to: Z:/RTemp\Rtmpo7Y8u1\file63602fc514b3.parquet
# Clean up
dbDisconnect(con, shutdown = TRUE)
Created on 2024-08-03 with reprex v2.1.1