fst icon indicating copy to clipboard operation
fst copied to clipboard

add python module to read/write from python panda

Open BenoitLondon opened this issue 8 years ago • 4 comments

Hi, nice package! It would be a great competitor to feather package if it was compatible with python pandas dataframes. Any plan to make it available in python?

Cheers, Benoit

PS: my own benchmarks

> r_bench <- microbenchmark(
+     read_f = {dt1 <- read_feather(path = filename)},
+     read_dt = {dt1 <- fread(file = gsub(".feather", ".csv", filename), showProgress = FALSE)},
+     read_fst = {dt2 <- read.fst(path = gsub(".feather", ".fst", filename))},
+     read_fstc = {dt2 <- read.fst(path = gsub(".feather", ".fstc", filename))},
+     read_rds = {dt2 <- readRDS(file = gsub(".feather", ".rds", filename))},
+     read_rdsc = {dt2 <- readRDS(file = gsub(".feather", ".rdsc", filename))},
+     times = 3)
> 
> print(r_bench)
Unit: milliseconds
      expr       min        lq      mean    median        uq       max neval
    read_f  73.49535  74.38310  74.80852  75.27085  75.46511  75.65938     3
   read_dt 409.07989 410.28315 411.33413 411.48641 412.46125 413.43609     3
  read_fst  67.21488  69.68649  74.13367  72.15810  77.59306  83.02803     3
 read_fstc 113.58359 113.87905 114.01423 114.17451 114.22955 114.28458     3
  read_rds 363.55270 366.95543 370.44090 370.35816 373.88500 377.41183     3
 read_rdsc 571.20738 571.27464 575.87312 571.34189 578.20598 585.07008     3

> w_bench <- microbenchmark(
+     write_f = {write_feather(x = dt, path = filename)},
+     write_dt = {fwrite(dt, file = gsub(".feather", ".csv", filename))},
+     write_fst = {write.fst(x = dt, path = gsub(".feather", ".fst", filename))},
+     write_fstc = {write.fst(x = dt, path = gsub(".feather", ".fstc", filename),compress = 100)},
+     write_rds = {saveRDS(object = dt, file = gsub(".feather", ".rds", filename),compress = FALSE)},
+     write_rdsc = {saveRDS(object = dt, file = gsub(".feather", ".rdsc", filename),compress = TRUE)},
+     times = 3)
> 
> print(w_bench)
Unit: milliseconds
       expr        min         lq       mean     median         uq        max neval
    write_f   77.57399   81.01968   84.72863   84.46536   88.30596   92.14655     3
   write_dt   65.89461   69.54576  538.90557   73.19692  775.41105 1477.62517     3
  write_fst   73.60318   75.90385  626.80981   78.20452  903.41312 1728.62172     3
 write_fstc  202.33712  211.38273  220.21007  220.42834  229.14654  237.86473     3
  write_rds  329.07046 3128.41469 4061.86755 5927.75891 5928.26610 5928.77328     3
 write_rdsc 2436.99475 2443.04194 2447.12685 2449.08913 2452.19291 2455.29668     3

BenoitLondon avatar Mar 08 '17 12:03 BenoitLondon

Hi @BenoitLondon , thanks for submitting your issue and your benchmarks and I would be very interested in the exact data set that you used for performing your benchmarks! Regarding the Python request, it is definitely the idea to make fst available for pandas data structures. But first additional features will be developed on the R platform before porting to Python (such as multi-threaded compression, parallel sorting, row streaming, appending data to existing fst files and the use of SSE2 instructions). However, I will start to refactor more of the core functionality of the fst package in an independent C++ module in coming versions, so that migrating to Python will take less effort. The fst package in it's current state gets some of it's speed gains (for compression) from direct bit-mapping from R's (sometimes peculiar) memory structure into a compressed format. These bit-mappers will have to be rewritten into a format suitable for Python, which will take some effort and time. But your feature request is definitely on the list!

MarcusKlik avatar Mar 08 '17 20:03 MarcusKlik

Hi thanks for the quick answer! Advantage over feather is that files are compressed in the same speed ballpark.

For these benchmarks I used a dummy data.table

filename <- "../data/test.feather"
dt <- data.table(a = sample(letters, 1e6, replace = TRUE), b= round(100 * runif(1e6)), c = 1:(1e6), d = rnorm(1e6), e =1, f = c(NA, "adsasas"))

Though I tested with a real file I use which is much bigger (574607 obs. of 78 variables) but probably sparser and got these results where (compressed) fst is the fastest to read and write:

the csv size is here 182MB , the fst 276MB and the fstc 32.3MB

> r_bench <- microbenchmark(
+     read_f = {dt1 <- read_feather(path = filename)},
+     read_dt = {dt1 <- fread(file = gsub(".feather", ".csv", filename), showProgress = FALSE)},
+     read_fst = {dt2 <- read.fst(path = gsub(".feather", ".fst", filename))},
+     read_fstc = {dt2 <- read.fst(path = gsub(".feather", ".fstc", filename))},
+     read_rds = {dt2 <- readRDS(file = gsub(".feather", ".rds", filename))},
+     read_rdsc = {dt2 <- readRDS(file = gsub(".feather", ".rdsc", filename))},
+     times = 3)
> print(r_bench, signif = 5)
Unit: milliseconds
      expr     min      lq      mean  median      uq     max neval
    read_f  778.32  788.25  793.8522  798.19  801.62  805.05     3
   read_dt 2154.20 2248.70 2307.9869 2343.30 2384.90 2426.50     3
  read_fst  424.97  583.92  645.9801  742.87  756.49  770.11     3
 read_fstc  717.11  732.85  833.5493  748.59  891.77 1035.00     3
  read_rds 1911.60 1920.70 1928.6099 1929.80 1937.10 1944.50     3
 read_rdsc 3259.70 3322.20 3360.1304 3384.80 3410.40 3435.90     3


> w_bench <- microbenchmark(
+     write_f = {write_feather(x = dt, path = filename)},
+     write_dt = {fwrite(dt, file = gsub(".feather", ".csv", filename))},
+     write_fst = {write.fst(x = dt, path = gsub(".feather", ".fst", filename))},
+     write_fstc = {write.fst(x = dt, path = gsub(".feather", ".fstc", filename),compress = 100)},
+     write_rds = {saveRDS(object = dt, file = gsub(".feather", ".rds", filename),compress = FALSE)},
+     write_rdsc = {saveRDS(object = dt, file = gsub(".feather", ".rdsc", filename),compress = TRUE)},
+     times = 3)
> 
> print(w_bench, signif = 3)
Unit: seconds
       expr  min   lq     mean median   uq  max neval
    write_f 2.37 2.37 2.454158   2.38 2.50 2.62     3
   write_dt 1.18 1.43 1.515637   1.68 1.68 1.69     3
  write_fst 2.31 2.35 2.391432   2.38 2.43 2.48     3
 write_fstc 1.32 1.34 1.353537   1.36 1.37 1.38     3
  write_rds 3.84 4.99 5.594808   6.14 6.47 6.81     3
 write_rdsc 8.42 8.44 8.466584   8.46 8.49 8.52     3

BenoitLondon avatar Mar 09 '17 17:03 BenoitLondon

Hi @BenoitLondon , thanks a lot for that! It's very interesting to see the large variation in benchmark results between systems. I ran your benchmark and noticed that they are completely dominated by the character column (a) in your data.table. These columns are by far the slowest to process, because in R, each string in a character vector has it's own memory address. The other basic type columns occupy a contiguous chunk of memory, which is much faster to access (from C++). Feather also has this problem for character columns. Notice however, how things change when you change your data.table into a data.frame (and for a data set 10 times as large):

dt <- data.frame(
  a = sample(letters, 1e7, replace = TRUE),
  b= round(100 * runif(1e7)),
  c = 1:(1e7),
  d = rnorm(1e7),
  e = 1,
  f = c(NA, "adsasas"))

microbenchmark(
  write_f = write_feather(dt, "0.feather"),
  write_dt = fwrite(dt, "0.csv"),
  write.fst = write.fst(dt, "0.fst"),
  write.fst50 = write.fst(dt, "50.fst", compress = 50),
  write.fst100 = write.fst(dt, "100.fst", compress = 100),
  write_rds = saveRDS(dt, "0.rds", compress = FALSE),
  write_rdsc = saveRDS(dt, "100.rds", compress = TRUE),
  times = 1)

gives:

Unit: milliseconds
         expr        min         lq       mean     median         uq        max neval
      write_f   659.3132   659.3132   659.3132   659.3132   659.3132   659.3132     1
     write_dt   894.2941   894.2941   894.2941   894.2941   894.2941   894.2941     1
    write.fst   519.5184   519.5184   519.5184   519.5184   519.5184   519.5184     1
  write.fst50   339.5828   339.5828   339.5828   339.5828   339.5828   339.5828     1
 write.fst100  1792.4256  1792.4256  1792.4256  1792.4256  1792.4256  1792.4256     1
    write_rds   848.7571   848.7571   848.7571   848.7571   848.7571   848.7571     1
   write_rdsc 24482.1443 24482.1443 24482.1443 24482.1443 24482.1443 24482.1443     1

and

microbenchmark(
  read_f = read_feather("0.feather"),
  read_dt = fread("0.csv"),
  read.fst = read.fst("0.fst"),
  read.fst50 = read.fst("50.fst"),
  read.fst100 = read.fst("100.fst"),
  read_rds = readRDS("0.rds"),
  read_rdsc = readRDS("100.rds"),
  times = 1)

gives me:

Unit: milliseconds
        expr        min         lq       mean     median         uq        max neval
      read_f   353.0693   353.0693   353.0693   353.0693   353.0693   353.0693     1
     read_dt 13366.1288 13366.1288 13366.1288 13366.1288 13366.1288 13366.1288     1
    read.fst   238.9952   238.9952   238.9952   238.9952   238.9952   238.9952     1
  read.fst50   434.6473   434.6473   434.6473   434.6473   434.6473   434.6473     1
 read.fst100   722.9549   722.9549   722.9549   722.9549   722.9549   722.9549     1
    read_rds   684.6390   684.6390   684.6390   684.6390   684.6390   684.6390     1
   read_rdsc  1938.6857  1938.6857  1938.6857  1938.6857  1938.6857  1938.6857     1

The maximum read speed measured is 1e-9 * object.size(dt) / 0.239 equals 1.51 GB/s. And for writing, compression is so fast that the compress = 50 variant is actually faster than the uncompressed variant.

The trick here is that with a data.frame, column a is now a factor instead of a character column.

MarcusKlik avatar Mar 09 '17 20:03 MarcusKlik

I will address the 'slow character column problem' in a future release by using a fast factorization method on a character column (using the boost library). When that is done, the performance will be much faster as shown above!

MarcusKlik avatar Mar 09 '17 20:03 MarcusKlik