fst icon indicating copy to clipboard operation
fst copied to clipboard

compression size on highly redundant data

Open arunsrinivasan opened this issue 8 years ago • 5 comments

Hi Mark, I was wondering if you've an explanation as to why the compression from fst on this particular data.frame seems to end up with larger size compared to native (rda) format. The entropy of each column is the minimal it could be... Do you think there's room for improvements on such cases?

require(fst) # CRAN  version
df <- data.frame(
        x=rep(1, 1e8),
        y=rep(2, 1e8),
        z=rep(3, 1e8)
      )

fst <- tempfile()
rda <- tempfile()

write.fst(df, fst, compress=100) # 2s
save(list="df", file=rda)        # 22s

file.info(fst)$size/1024 # 5102.4 KB
file.info(rda)$size/1024 # 3410.6 KB

Thank you.

arunsrinivasan avatar Oct 26 '17 14:10 arunsrinivasan

Hi @arunsrinivasan, nice to hear from you and interesting question!

Compression in fst is implemented by using the LZ4 and ZSTD compressors. Those compressors have reasonably fast compressions speeds and extreme decompression speeds (a feature shared among all LZ type compressors). The compression ratio for these algorithms is lower than that of the compressors used internally in R however (R uses gzip, bzip2 and xy internally). In general, the slower algorithms have better compression ratio's, it's always a trade-off between speed and ratio off course :-) (but LZ4 and ZSTD perform very well in that trade-off).

But you are right, you would expect a compressor to be able to do more with such low-entropy data! The problem is that double's are very hard to compress effectively, as there is hardly any relation between bytes of double's that are close in value (due to the IEEE 754 format). For example, I found that applying bit-shifters before compression (to group corresponding bytes from a double) hardly does anything for compression (but takes CPU power).

With integers, the picture completely changes:

df <- data.frame(
  x = rep(1L, 1e8),
  y = rep(2L, 1e8),
  z = rep(3L, 1e8)
)

# use 4 threads
fst::fstsetthreads(4)
#> [1] 8

bench_save <- microbenchmark::microbenchmark(
  save(list = "df", file = "1.rda"),
  times = 1
)

bench_fst <- microbenchmark::microbenchmark(
  fst::write.fst(df, "1.fst", compress = 100),
  times = 1
)

# compression ratio save (in percentage)
100 * file.size("1.rda") / as.numeric(object.size(df))
#> [1] 0.09707268

# compression ratio fst (in percentage)
100 * file.size("1.fst") / as.numeric(object.size(df))
#> [1] 0.1953535

# compressed write speed save in GB/s
as.numeric(object.size(df)) / bench_save$time
#> [1] 0.09628579

# compressed write speed fst in GB/s
as.numeric(object.size(df)) / bench_fst$time
#> [1] 2.189635

So fst serializes your table with a speed of more than 2 GB/s while save measures around 100 MB/s (a factor of 20 faster). The ratio's are about 0.20 and 0.096 percent (factor of 500 and 1000), much better than with doubles.

Perhaps there are better compressors out there for double's, I guess your example shows that there is much room for improvement!

fst uses a sort of plug-in system for compressors, so we can add a new compressor quite easily. If we can talk to the compressor using a specific function format defined in fstcore, the compressor can be used (and most compressors have a similar interface). Multi-threading is done in fst, so you can use single-threaded compressors to compress your table with multiple threads (fst uses 16 kB blocks to store data which can be processed in parallel).

For each type, a map is defined between compression setting (0:100) and the compressors used. For integers that map is defined here for example. You can see that the map requires very little code and can be changed easily. Also, more than 1 compressor can be used for different ranges of compression settings and multiple compressors can be mixed during writing (using the StreamCompositeCompressor class also used in the code example).

This system enables fine-tuning of compression settings per column type. So we can use a different compressor for integers and double's if we want, each column type has it's own optimal compression settings. Currently the settings are far from optimal, much more testing has to be done to determine these optimal settings for fst for each column type.

Other compressors might be interesting to investigate in the future, for example dictionary based compressors are much better in compressing string columns, so could be used there (ZSTD also has a dictionary mode by the way)

A very long answer to a short question :-), any ideas on future compression modes are very welcome (would a high-compression option be useful ?)

greetings, Mark

MarcusKlik avatar Oct 26 '17 21:10 MarcusKlik

Mark, the explanation is quite clear, and makes sense. It's also quite nice that new algorithms can be just plugged in, and that different compressors can be used on different columns. I've not did a literature survey for compression algorithms in a long time, but will see if I can find something as well.

Given the speed of fst, higher compression could be very useful as it might be okay to compromise on writes/reads (from 100th of a sec to, say, 10th of a sec) if it could reduce file size further (in cases where one deals with a lot of files, for example).

I'll leave closing of this issue to you. If you think it could've a better title / restructured, feel free to.

Thank you once again for the quick and clear reply.

arunsrinivasan avatar Oct 27 '17 09:10 arunsrinivasan

@MarcusKlik

Base on experiment below I guess that size of written file depends not only on compression method. Is it 16k block size or something else ? Make it sense to expose this parameter insidewrite.fst command for tuning ?

  int <- rep(1L, 1e8)
  int <- serialize(int,NULL)
  object.size(compress_fst(int, compressor = "ZSTD", compression = 100, hash = TRUE))
  34 912 bytes
  object.size(compress_fst(int, compressor = "LZ4", compression = 100, hash = TRUE))
  1 569 760 bytes
  num <- rep(1, 1e8)
  num <- serialize(num,NULL)
  object.size(compress_fst(num, compressor = "ZSTD", compression = 100, hash = TRUE))
  68 872 bytes
  object.size(compress_fst(num, compressor = "LZ4", compression = 100, hash = TRUE))
  3 138 616 bytes

robocop-bob avatar Nov 26 '18 15:11 robocop-bob

Hi @robbig2871, thanks for your question.

are you asking about the differences in compression ratio between LZ4 and ZSTD or between compress_fst() and write_fst ? As your example shows, the differences can be quite significant and depend a lot on the specific data used.

Also, the compression techniques used in write_fst are different from those used in compress_fst. For example, your vectors will give different compression ratio's when stored on disk in a fst file:

library(fst)

# int vec
int <- rep(1L, 1e8)
int_raw <- serialize(int, NULL)

# 400 MB
object.size(int_raw)
#> 400000072 bytes

# double vec
num <- rep(1, 1e8)
num_raw <- serialize(num, NULL)

# 800 MB
object.size(num_raw)
#> 800000072 bytes

tmp <- tempfile()
write_fst(data.frame(Int = int), tmp, compress = 100)

# size of int stored in fst file
file.size(tmp)
#> [1] 757199

# size of num stored in fst file
write_fst(data.frame(Num = num), tmp, compress = 100)
file.size(tmp)
#> [1] 1611689

As you can see, the on-disk compression ratio using the maximum compression setting is in between the ratio's in your experiment (note that the fst file also contains meta-data which increases it's size).

For integers and numerical values, method write_fst() uses a bit-shifter before applying compression. In general, this helps the faster compression settings to get better compression ratio's, but also makes it more difficult to really compare the differences in compression ratio's between compress_fst() and write_fst().

Also, the file format uses 16 kB blocks, as you mentioned, which will reduce the compression ratio a little (but enables solid random access).

In contrast, compress_fst divides the input vector in 48 blocks, with the blocksize chosen to fit in between certain boundaries. So the blocks for in-memory compression can be much bigger (up to almost 2 GB per block), this will certainly help compression for the higher settings.

I hope this answers some of your questions!

MarcusKlik avatar Nov 26 '18 22:11 MarcusKlik

There may be considerable value in just checking whether a column of doubles has only a few unique values before encoding it with the generic doubles compressor. I find that you can get considerable space savings by reencoding the column as integers.

> set.seed(10); u = rnorm(500); x = sample(u, 10e6, rep = T)
> write.fst(as.data.frame(x), "/scratch/test-float.fst", compress = 100)
> write.fst(as.data.frame(match(x, unique(x))), "/scratch/test-int.fst", compress = 100)
> file.size("/scratch/test-float.fst") / (2^20)
[1] 30.84678
> file.size("/scratch/test-int.fst") / (2^20)
[1] 11.98134

Kodiologist avatar May 17 '21 13:05 Kodiologist