fst
fst copied to clipboard
add python module to read/write from python panda
Hi, nice package! It would be a great competitor to feather package if it was compatible with python pandas dataframes. Any plan to make it available in python?
Cheers, Benoit
PS: my own benchmarks
> r_bench <- microbenchmark(
+ read_f = {dt1 <- read_feather(path = filename)},
+ read_dt = {dt1 <- fread(file = gsub(".feather", ".csv", filename), showProgress = FALSE)},
+ read_fst = {dt2 <- read.fst(path = gsub(".feather", ".fst", filename))},
+ read_fstc = {dt2 <- read.fst(path = gsub(".feather", ".fstc", filename))},
+ read_rds = {dt2 <- readRDS(file = gsub(".feather", ".rds", filename))},
+ read_rdsc = {dt2 <- readRDS(file = gsub(".feather", ".rdsc", filename))},
+ times = 3)
>
> print(r_bench)
Unit: milliseconds
expr min lq mean median uq max neval
read_f 73.49535 74.38310 74.80852 75.27085 75.46511 75.65938 3
read_dt 409.07989 410.28315 411.33413 411.48641 412.46125 413.43609 3
read_fst 67.21488 69.68649 74.13367 72.15810 77.59306 83.02803 3
read_fstc 113.58359 113.87905 114.01423 114.17451 114.22955 114.28458 3
read_rds 363.55270 366.95543 370.44090 370.35816 373.88500 377.41183 3
read_rdsc 571.20738 571.27464 575.87312 571.34189 578.20598 585.07008 3
> w_bench <- microbenchmark(
+ write_f = {write_feather(x = dt, path = filename)},
+ write_dt = {fwrite(dt, file = gsub(".feather", ".csv", filename))},
+ write_fst = {write.fst(x = dt, path = gsub(".feather", ".fst", filename))},
+ write_fstc = {write.fst(x = dt, path = gsub(".feather", ".fstc", filename),compress = 100)},
+ write_rds = {saveRDS(object = dt, file = gsub(".feather", ".rds", filename),compress = FALSE)},
+ write_rdsc = {saveRDS(object = dt, file = gsub(".feather", ".rdsc", filename),compress = TRUE)},
+ times = 3)
>
> print(w_bench)
Unit: milliseconds
expr min lq mean median uq max neval
write_f 77.57399 81.01968 84.72863 84.46536 88.30596 92.14655 3
write_dt 65.89461 69.54576 538.90557 73.19692 775.41105 1477.62517 3
write_fst 73.60318 75.90385 626.80981 78.20452 903.41312 1728.62172 3
write_fstc 202.33712 211.38273 220.21007 220.42834 229.14654 237.86473 3
write_rds 329.07046 3128.41469 4061.86755 5927.75891 5928.26610 5928.77328 3
write_rdsc 2436.99475 2443.04194 2447.12685 2449.08913 2452.19291 2455.29668 3
Hi @BenoitLondon , thanks for submitting your issue and your benchmarks and I would be very interested in the exact data set that you used for performing your benchmarks!
Regarding the Python request, it is definitely the idea to make fst available for pandas data structures. But first additional features will be developed on the R platform before porting to Python (such as multi-threaded compression, parallel sorting, row streaming, appending data to existing fst files and the use of SSE2 instructions).
However, I will start to refactor more of the core functionality of the fst package in an independent C++ module in coming versions, so that migrating to Python will take less effort.
The fst package in it's current state gets some of it's speed gains (for compression) from direct bit-mapping from R's (sometimes peculiar) memory structure into a compressed format. These bit-mappers will have to be rewritten into a format suitable for Python, which will take some effort and time.
But your feature request is definitely on the list!
Hi thanks for the quick answer! Advantage over feather is that files are compressed in the same speed ballpark.
For these benchmarks I used a dummy data.table
filename <- "../data/test.feather"
dt <- data.table(a = sample(letters, 1e6, replace = TRUE), b= round(100 * runif(1e6)), c = 1:(1e6), d = rnorm(1e6), e =1, f = c(NA, "adsasas"))
Though I tested with a real file I use which is much bigger (574607 obs. of 78 variables) but probably sparser and got these results where (compressed) fst is the fastest to read and write:
the csv size is here 182MB , the fst 276MB and the fstc 32.3MB
> r_bench <- microbenchmark(
+ read_f = {dt1 <- read_feather(path = filename)},
+ read_dt = {dt1 <- fread(file = gsub(".feather", ".csv", filename), showProgress = FALSE)},
+ read_fst = {dt2 <- read.fst(path = gsub(".feather", ".fst", filename))},
+ read_fstc = {dt2 <- read.fst(path = gsub(".feather", ".fstc", filename))},
+ read_rds = {dt2 <- readRDS(file = gsub(".feather", ".rds", filename))},
+ read_rdsc = {dt2 <- readRDS(file = gsub(".feather", ".rdsc", filename))},
+ times = 3)
> print(r_bench, signif = 5)
Unit: milliseconds
expr min lq mean median uq max neval
read_f 778.32 788.25 793.8522 798.19 801.62 805.05 3
read_dt 2154.20 2248.70 2307.9869 2343.30 2384.90 2426.50 3
read_fst 424.97 583.92 645.9801 742.87 756.49 770.11 3
read_fstc 717.11 732.85 833.5493 748.59 891.77 1035.00 3
read_rds 1911.60 1920.70 1928.6099 1929.80 1937.10 1944.50 3
read_rdsc 3259.70 3322.20 3360.1304 3384.80 3410.40 3435.90 3
> w_bench <- microbenchmark(
+ write_f = {write_feather(x = dt, path = filename)},
+ write_dt = {fwrite(dt, file = gsub(".feather", ".csv", filename))},
+ write_fst = {write.fst(x = dt, path = gsub(".feather", ".fst", filename))},
+ write_fstc = {write.fst(x = dt, path = gsub(".feather", ".fstc", filename),compress = 100)},
+ write_rds = {saveRDS(object = dt, file = gsub(".feather", ".rds", filename),compress = FALSE)},
+ write_rdsc = {saveRDS(object = dt, file = gsub(".feather", ".rdsc", filename),compress = TRUE)},
+ times = 3)
>
> print(w_bench, signif = 3)
Unit: seconds
expr min lq mean median uq max neval
write_f 2.37 2.37 2.454158 2.38 2.50 2.62 3
write_dt 1.18 1.43 1.515637 1.68 1.68 1.69 3
write_fst 2.31 2.35 2.391432 2.38 2.43 2.48 3
write_fstc 1.32 1.34 1.353537 1.36 1.37 1.38 3
write_rds 3.84 4.99 5.594808 6.14 6.47 6.81 3
write_rdsc 8.42 8.44 8.466584 8.46 8.49 8.52 3
Hi @BenoitLondon , thanks a lot for that! It's very interesting to see the large variation in benchmark results between systems. I ran your benchmark and noticed that they are completely dominated by the character column (a) in your data.table. These columns are by far the slowest to process, because in R, each string in a character vector has it's own memory address. The other basic type columns occupy a contiguous chunk of memory, which is much faster to access (from C++). Feather also has this problem for character columns. Notice however, how things change when you change your data.table into a data.frame (and for a data set 10 times as large):
dt <- data.frame(
a = sample(letters, 1e7, replace = TRUE),
b= round(100 * runif(1e7)),
c = 1:(1e7),
d = rnorm(1e7),
e = 1,
f = c(NA, "adsasas"))
microbenchmark(
write_f = write_feather(dt, "0.feather"),
write_dt = fwrite(dt, "0.csv"),
write.fst = write.fst(dt, "0.fst"),
write.fst50 = write.fst(dt, "50.fst", compress = 50),
write.fst100 = write.fst(dt, "100.fst", compress = 100),
write_rds = saveRDS(dt, "0.rds", compress = FALSE),
write_rdsc = saveRDS(dt, "100.rds", compress = TRUE),
times = 1)
gives:
Unit: milliseconds
expr min lq mean median uq max neval
write_f 659.3132 659.3132 659.3132 659.3132 659.3132 659.3132 1
write_dt 894.2941 894.2941 894.2941 894.2941 894.2941 894.2941 1
write.fst 519.5184 519.5184 519.5184 519.5184 519.5184 519.5184 1
write.fst50 339.5828 339.5828 339.5828 339.5828 339.5828 339.5828 1
write.fst100 1792.4256 1792.4256 1792.4256 1792.4256 1792.4256 1792.4256 1
write_rds 848.7571 848.7571 848.7571 848.7571 848.7571 848.7571 1
write_rdsc 24482.1443 24482.1443 24482.1443 24482.1443 24482.1443 24482.1443 1
and
microbenchmark(
read_f = read_feather("0.feather"),
read_dt = fread("0.csv"),
read.fst = read.fst("0.fst"),
read.fst50 = read.fst("50.fst"),
read.fst100 = read.fst("100.fst"),
read_rds = readRDS("0.rds"),
read_rdsc = readRDS("100.rds"),
times = 1)
gives me:
Unit: milliseconds
expr min lq mean median uq max neval
read_f 353.0693 353.0693 353.0693 353.0693 353.0693 353.0693 1
read_dt 13366.1288 13366.1288 13366.1288 13366.1288 13366.1288 13366.1288 1
read.fst 238.9952 238.9952 238.9952 238.9952 238.9952 238.9952 1
read.fst50 434.6473 434.6473 434.6473 434.6473 434.6473 434.6473 1
read.fst100 722.9549 722.9549 722.9549 722.9549 722.9549 722.9549 1
read_rds 684.6390 684.6390 684.6390 684.6390 684.6390 684.6390 1
read_rdsc 1938.6857 1938.6857 1938.6857 1938.6857 1938.6857 1938.6857 1
The maximum read speed measured is 1e-9 * object.size(dt) / 0.239 equals 1.51 GB/s. And for writing, compression is so fast that the compress = 50 variant is actually faster than the uncompressed variant.
The trick here is that with a data.frame, column a is now a factor instead of a character column.
I will address the 'slow character column problem' in a future release by using a fast factorization method on a character column (using the boost library). When that is done, the performance will be much faster as shown above!