fst icon indicating copy to clipboard operation
fst copied to clipboard

benchmark fst sqlite and feather

Open ssh352 opened this issue 6 years ago • 2 comments

I came across this blog post, sqlite is much faster in their case. I think it is because the which operation is very slow, what do you think?

https://kbroman.org/blog/2017/04/30/sqlite-feather-and-fst/

ssh352 avatar Sep 25 '19 06:09 ssh352

Hi @ssh352, thanks for your question!

I'm a big fan of @kbroman and his blogs, although he might have taken a few shortcuts in this one :-)

In general, comparing different serialization solutions is tricky as measurements depend on a lot of factors:

  • the (type of) data used
  • compressibility of the data
  • disk speed and IOPS
  • number of threads used
  • data dimensions (ratio rows / columns)
  • data size
  • CPU speed and (power) settings
  • the 'orientation' of the serializer (row- or column- wise)
  • if the serializer can read with random-access
  • ...

For example, in the post a comparison is made between different compression settings, but the dataset used consists of incompressible column vectors (fully random numbers), so compression can only worsen performance (also at default fst settings).

Also, the dataset is a matrix with many columns and relatively small number of rows. fst is optimized for very large row sizes and low column / row ratio's. The post uses a dataset with only 1e5 rows. For such small table's, column-wise multi-threading is more expensive than for larger table's.

The microbenchmark measurement done at the end uses 100 cycle's. This shouldn't be done for serialization, as modern SSD's use large caches that will avoid reading the actual data from cycle 2 onwards (it's just retrieved from cache memory). Each read should be a single cycle done on a unique file (dataset).

Also, system.time() doesn't play well with multithreaded solutions and can show the elapsed time of all threads (instead of total time).

As you can see, there are a lot of parameters that can be tuned which can have a large effect on the measurements.

Below are the benchmarks for reading and writing a full dataset from/to disk on my system (6/12 core i7-8750H with 2 NVME SSD's in RAID 0 configuration). This run demonstrates that a different selection of data can completely alter the benchmark results.

library(fst)
library(RSQLite)
library(arrow)
library(feather)
library(data.table)

nr_of_rows <- 1e8
nr_of_columns <- 5

# select dataset
create_column <- function() { sample(1:100, nr_of_rows, replace = TRUE) }

# use first column for row number
dt <- data.table(X = 1:nr_of_rows)
sapply(1:(nr_of_columns - 1), function(x) {
  dt[, c(as.character(x)) := create_column()]
})

file.remove(list.files(pattern = "dt."))
gc()

sqldb <- dbConnect(SQLite(), dbname = "dt.sqlite")
saveRDS("wakeup", "wakeup.rds")

# full table writing
microbenchmark::microbenchmark(
  feather = write_feather(dt, "dt.feather"),
  fst50 = write_fst(dt, "dt.fst_50"),
  fst80 = write_fst(dt, "dt.fst_80", compress = 80),
  arrow = write_arrow(dt, "dt.arrow"),
  sqlite = dbWriteTable(sqldb, "dt", dt, row.names = FALSE, overwrite = TRUE, append = FALSE, field.types = NULL),
  rds_uncompressed = saveRDS(dt, "dt.rds_uncompressed", compress = FALSE),
  rds_compressed = saveRDS(dt, "dt.rds_compressed", compress = TRUE),
  times = 1
)
#> Unit: milliseconds
#>              expr         min          lq        mean      median
#>           feather   2062.4467   2062.4467   2062.4467   2062.4467
#>             fst50    251.1507    251.1507    251.1507    251.1507
#>             fst80   2193.1777   2193.1777   2193.1777   2193.1777
#>             arrow   2249.4808   2249.4808   2249.4808   2249.4808
#>            sqlite  60753.6885  60753.6885  60753.6885  60753.6885
#>  rds_uncompressed   2481.2388   2481.2388   2481.2388   2481.2388
#>    rds_compressed 194580.9560 194580.9560 194580.9560 194580.9560
#>           uq         max neval
#>    2062.4467   2062.4467     1
#>     251.1507    251.1507     1
#>    2193.1777   2193.1777     1
#>    2249.4808   2249.4808     1
#>   60753.6885  60753.6885     1
#>    2481.2388   2481.2388     1
#>  194580.9560 194580.9560     1

# full table reading
microbenchmark::microbenchmark(
  feather = read_feather("dt.feather"),
  fst_50 = read_fst("dt.fst_50"),
  fst_80 = read_fst("dt.fst_80"),
  arrow = read_arrow("dt.arrow"),
  sqlite = dbGetQuery(sqldb, "select * from dt"),
  rds_uncompressed = readRDS("dt.rds_uncompressed"),
  rds_compressed = readRDS("dt.rds_compressed"),
  times = 1
)
#> Unit: milliseconds
#>              expr        min         lq       mean     median         uq
#>           feather  9962.4628  9962.4628  9962.4628  9962.4628  9962.4628
#>            fst_50   988.2682   988.2682   988.2682   988.2682   988.2682
#>            fst_80   717.6037   717.6037   717.6037   717.6037   717.6037
#>             arrow  1292.8968  1292.8968  1292.8968  1292.8968  1292.8968
#>            sqlite 77834.6914 77834.6914 77834.6914 77834.6914 77834.6914
#>  rds_uncompressed  2520.9512  2520.9512  2520.9512  2520.9512  2520.9512
#>    rds_compressed  8229.7116  8229.7116  8229.7116  8229.7116  8229.7116
#>         max neval
#>   9962.4628     1
#>    988.2682     1
#>    717.6037     1
#>   1292.8968     1
#>  77834.6914     1
#>   2520.9512     1
#>   8229.7116     1

# reading a subset of rows
from_row <- 1e6
to_row <- 2e6
query <- paste0("select * from dt where X > ", from_row, " and X < ", to_row)

microbenchmark::microbenchmark(
  feather = {feat <- feather("dt.feather")
    feat[from_row:to_row,] },
  fst_50 = read_fst("dt.fst_50", from = from_row, to = to_row),
  fst_80 = read_fst("dt.fst_80", from = from_row, to = to_row),
  sqlite =  dbGetQuery(sqldb, query),
  times = 1
)
#> Unit: milliseconds
#>     expr       min        lq      mean    median        uq       max neval
#>  feather 5264.0108 5264.0108 5264.0108 5264.0108 5264.0108 5264.0108     1
#>   fst_50   11.6011   11.6011   11.6011   11.6011   11.6011   11.6011     1
#>   fst_80   12.6246   12.6246   12.6246   12.6246   12.6246   12.6246     1
#>   sqlite 6614.4568 6614.4568 6614.4568 6614.4568 6614.4568 6614.4568     1

# reading 1 column
microbenchmark::microbenchmark(
  feather = {feat <- feather("dt.feather")
    feat[, "3"] },
  fst_50 = read_fst("dt.fst_50", "3"),
  fst_80 = read_fst("dt.fst_80", "3"),
  sqlite =  dbGetQuery(sqldb, "select 3 from dt"),
  times = 1
)
#> Unit: milliseconds
#>     expr        min         lq       mean     median         uq        max
#>  feather   257.7663   257.7663   257.7663   257.7663   257.7663   257.7663
#>   fst_50   190.5177   190.5177   190.5177   190.5177   190.5177   190.5177
#>   fst_80   197.0219   197.0219   197.0219   197.0219   197.0219   197.0219
#>   sqlite 18375.9254 18375.9254 18375.9254 18375.9254 18375.9254 18375.9254
#>  neval
#>      1
#>      1
#>      1
#>      1

# reading 1 row
row <- sample(1:nrow(dt), 1)
microbenchmark::microbenchmark(
  feather = {feat <- feather("dt.feather")
    feat[row, ] },
  fst_50 = read_fst("dt.fst_50", from = row, to = row),
  fst_80 = read_fst("dt.fst_80", from = row, to = row),
  sqlite =  dbGetQuery(sqldb, paste("select * from dt where X =", row)),
  times = 1
)
#> Unit: microseconds
#>     expr         min          lq        mean      median          uq
#>  feather 1317844.402 1317844.402 1317844.402 1317844.402 1317844.402
#>   fst_50     598.201     598.201     598.201     598.201     598.201
#>   fst_80     487.201     487.201     487.201     487.201     487.201
#>   sqlite 4317569.702 4317569.702 4317569.702 4317569.702 4317569.702
#>          max neval
#>  1317844.402     1
#>      598.201     1
#>      487.201     1
#>  4317569.702     1

dbDisconnect(sqldb)

In all these benchmarks, fst comes out best. The biggest difference with these benchmarks as compared to those of @kbroman's post is that the dataset has much more rows than columns, so that's much more easy to process for column-oriented serializers like arrow or fst.

With package syntheticbench I will try to facilitate fair and complete benchmarking for serializers on a wide range of column types and other features.

Thanks for your interesting question!

MarcusKlik avatar Sep 26 '19 12:09 MarcusKlik

Thank you,

I ran kbroman's benchmark. Performances for sqlite and fst are comparable while in his results, sqlite is much faster. His post is from 2017, the versions of packages may have some effect.

For your benchmark on sqlite, if you create an index as kbroman did, it will speed up sqlite.

ssh352 avatar Sep 27 '19 03:09 ssh352