fst
fst copied to clipboard
benchmark fst sqlite and feather
I came across this blog post, sqlite is much faster in their case. I think it is because the which operation is very slow, what do you think?
https://kbroman.org/blog/2017/04/30/sqlite-feather-and-fst/
Hi @ssh352, thanks for your question!
I'm a big fan of @kbroman and his blogs, although he might have taken a few shortcuts in this one :-)
In general, comparing different serialization solutions is tricky as measurements depend on a lot of factors:
- the (type of) data used
- compressibility of the data
- disk speed and IOPS
- number of threads used
- data dimensions (ratio rows / columns)
- data size
- CPU speed and (power) settings
- the 'orientation' of the serializer (row- or column- wise)
- if the serializer can read with random-access
- ...
For example, in the post a comparison is made between different compression settings, but the dataset used consists of incompressible column vectors (fully random numbers), so compression can only worsen performance (also at default fst settings).
Also, the dataset is a matrix with many columns and relatively small number of rows. fst is optimized for very large row sizes and low column / row ratio's. The post uses a dataset with only 1e5 rows. For such small table's, column-wise multi-threading is more expensive than for larger table's.
The microbenchmark measurement done at the end uses 100 cycle's. This shouldn't be done for serialization, as modern SSD's use large caches that will avoid reading the actual data from cycle 2 onwards (it's just retrieved from cache memory). Each read should be a single cycle done on a unique file (dataset).
Also, system.time() doesn't play well with multithreaded solutions and can show the elapsed time of all threads (instead of total time).
As you can see, there are a lot of parameters that can be tuned which can have a large effect on the measurements.
Below are the benchmarks for reading and writing a full dataset from/to disk on my system (6/12 core i7-8750H with 2 NVME SSD's in RAID 0 configuration). This run demonstrates that a different selection of data can completely alter the benchmark results.
library(fst)
library(RSQLite)
library(arrow)
library(feather)
library(data.table)
nr_of_rows <- 1e8
nr_of_columns <- 5
# select dataset
create_column <- function() { sample(1:100, nr_of_rows, replace = TRUE) }
# use first column for row number
dt <- data.table(X = 1:nr_of_rows)
sapply(1:(nr_of_columns - 1), function(x) {
dt[, c(as.character(x)) := create_column()]
})
file.remove(list.files(pattern = "dt."))
gc()
sqldb <- dbConnect(SQLite(), dbname = "dt.sqlite")
saveRDS("wakeup", "wakeup.rds")
# full table writing
microbenchmark::microbenchmark(
feather = write_feather(dt, "dt.feather"),
fst50 = write_fst(dt, "dt.fst_50"),
fst80 = write_fst(dt, "dt.fst_80", compress = 80),
arrow = write_arrow(dt, "dt.arrow"),
sqlite = dbWriteTable(sqldb, "dt", dt, row.names = FALSE, overwrite = TRUE, append = FALSE, field.types = NULL),
rds_uncompressed = saveRDS(dt, "dt.rds_uncompressed", compress = FALSE),
rds_compressed = saveRDS(dt, "dt.rds_compressed", compress = TRUE),
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median
#> feather 2062.4467 2062.4467 2062.4467 2062.4467
#> fst50 251.1507 251.1507 251.1507 251.1507
#> fst80 2193.1777 2193.1777 2193.1777 2193.1777
#> arrow 2249.4808 2249.4808 2249.4808 2249.4808
#> sqlite 60753.6885 60753.6885 60753.6885 60753.6885
#> rds_uncompressed 2481.2388 2481.2388 2481.2388 2481.2388
#> rds_compressed 194580.9560 194580.9560 194580.9560 194580.9560
#> uq max neval
#> 2062.4467 2062.4467 1
#> 251.1507 251.1507 1
#> 2193.1777 2193.1777 1
#> 2249.4808 2249.4808 1
#> 60753.6885 60753.6885 1
#> 2481.2388 2481.2388 1
#> 194580.9560 194580.9560 1
# full table reading
microbenchmark::microbenchmark(
feather = read_feather("dt.feather"),
fst_50 = read_fst("dt.fst_50"),
fst_80 = read_fst("dt.fst_80"),
arrow = read_arrow("dt.arrow"),
sqlite = dbGetQuery(sqldb, "select * from dt"),
rds_uncompressed = readRDS("dt.rds_uncompressed"),
rds_compressed = readRDS("dt.rds_compressed"),
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> feather 9962.4628 9962.4628 9962.4628 9962.4628 9962.4628
#> fst_50 988.2682 988.2682 988.2682 988.2682 988.2682
#> fst_80 717.6037 717.6037 717.6037 717.6037 717.6037
#> arrow 1292.8968 1292.8968 1292.8968 1292.8968 1292.8968
#> sqlite 77834.6914 77834.6914 77834.6914 77834.6914 77834.6914
#> rds_uncompressed 2520.9512 2520.9512 2520.9512 2520.9512 2520.9512
#> rds_compressed 8229.7116 8229.7116 8229.7116 8229.7116 8229.7116
#> max neval
#> 9962.4628 1
#> 988.2682 1
#> 717.6037 1
#> 1292.8968 1
#> 77834.6914 1
#> 2520.9512 1
#> 8229.7116 1
# reading a subset of rows
from_row <- 1e6
to_row <- 2e6
query <- paste0("select * from dt where X > ", from_row, " and X < ", to_row)
microbenchmark::microbenchmark(
feather = {feat <- feather("dt.feather")
feat[from_row:to_row,] },
fst_50 = read_fst("dt.fst_50", from = from_row, to = to_row),
fst_80 = read_fst("dt.fst_80", from = from_row, to = to_row),
sqlite = dbGetQuery(sqldb, query),
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> feather 5264.0108 5264.0108 5264.0108 5264.0108 5264.0108 5264.0108 1
#> fst_50 11.6011 11.6011 11.6011 11.6011 11.6011 11.6011 1
#> fst_80 12.6246 12.6246 12.6246 12.6246 12.6246 12.6246 1
#> sqlite 6614.4568 6614.4568 6614.4568 6614.4568 6614.4568 6614.4568 1
# reading 1 column
microbenchmark::microbenchmark(
feather = {feat <- feather("dt.feather")
feat[, "3"] },
fst_50 = read_fst("dt.fst_50", "3"),
fst_80 = read_fst("dt.fst_80", "3"),
sqlite = dbGetQuery(sqldb, "select 3 from dt"),
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> feather 257.7663 257.7663 257.7663 257.7663 257.7663 257.7663
#> fst_50 190.5177 190.5177 190.5177 190.5177 190.5177 190.5177
#> fst_80 197.0219 197.0219 197.0219 197.0219 197.0219 197.0219
#> sqlite 18375.9254 18375.9254 18375.9254 18375.9254 18375.9254 18375.9254
#> neval
#> 1
#> 1
#> 1
#> 1
# reading 1 row
row <- sample(1:nrow(dt), 1)
microbenchmark::microbenchmark(
feather = {feat <- feather("dt.feather")
feat[row, ] },
fst_50 = read_fst("dt.fst_50", from = row, to = row),
fst_80 = read_fst("dt.fst_80", from = row, to = row),
sqlite = dbGetQuery(sqldb, paste("select * from dt where X =", row)),
times = 1
)
#> Unit: microseconds
#> expr min lq mean median uq
#> feather 1317844.402 1317844.402 1317844.402 1317844.402 1317844.402
#> fst_50 598.201 598.201 598.201 598.201 598.201
#> fst_80 487.201 487.201 487.201 487.201 487.201
#> sqlite 4317569.702 4317569.702 4317569.702 4317569.702 4317569.702
#> max neval
#> 1317844.402 1
#> 598.201 1
#> 487.201 1
#> 4317569.702 1
dbDisconnect(sqldb)
In all these benchmarks, fst comes out best. The biggest difference with these benchmarks as compared to those of @kbroman's post is that the dataset has much more rows than columns, so that's much more easy to process for column-oriented serializers like arrow or fst.
With package syntheticbench I will try to facilitate fair and complete benchmarking for serializers on a wide range of column types and other features.
Thanks for your interesting question!
Thank you,
I ran kbroman's benchmark. Performances for sqlite and fst are comparable while in his results, sqlite is much faster. His post is from 2017, the versions of packages may have some effect.
For your benchmark on sqlite, if you create an index as kbroman did, it will speed up sqlite.