JavaFastPFOR icon indicating copy to clipboard operation
JavaFastPFOR copied to clipboard

compressed array become bigger

Open koron opened this issue 12 years ago • 7 comments

There are some cases that compressed array becomes bigger than original. You can see sample for it here: https://github.com/koron/JavaFastPFOR/compare/adhoc-tests Of course I know it is very rare and not target of this library. But it may cause a kind of uncovenience for users.

There are some workarounds:

  • Write it in document.
  • Make a utility methods to calculate preferred length for output array.
  • Introduce auto extendable buffer, it would be called IntArrayOutputStream or so.

What workaround do we choose?

koron avatar Sep 04 '13 19:09 koron

I have forgot 4th workaround: DO NOTHING :tongue:

koron avatar Sep 04 '13 19:09 koron

In case this is useful to you, I have now made you a collaborator on my version of the project. This means that you can work directly on the main code without going through a pull request. (I trust you.)

Let us review this problem.

First of all, you can use a fairly large output buffer... one that is larger than the input buffer. Say the input buffer contains N integers, you can just make sure that the output buffer contains at least 1.1 * N + 1024 integers and you are good with high probability.

When the output buffer is too small, an exception is thrown. One can catch this exception and realize that the output buffer was too small... one can then extend the buffer and try again. Yes, it is ugly... but in a given application, this can be made very unlikely.

How can we make this less ugly?

  1. I think keep things as they are is a good option. That is, you can add a new approach, but I don't think that the current approach is broken. For many people, it could be the best approach.
  2. Yes, you could have a helper class that tells you how bad things could be given an IntegerCODEC. This would require us to look at the IntegerCODEC and compute a bound on how badly the compression can go. This is not hard and could be very useful.
  3. A good approach from a software engineering perspective would be to write to an extensible buffer but this is likely to generate quite a bit of overhead. It is also not the most flexible.
  4. I think that the very best approach, from a software engineering perspective, is to use IntInputStream and IntOutputStream. This could be done as part of a distinct package for people who want nice streams.

To minimize overhead, you might want to have an API that allows you to read and write data in blocks. For example, you might want to have something like:

ReadableIntView readBlock(int number);

and

WritableIntView writeBlock(int number)

(This is just a vague idea.)

The goal would be that the overhead of using a stream interface would be tiny compared to the array-based interface.

Getting this right might require non-trivial engineering, but I think it is doable.

lemire avatar Sep 04 '13 20:09 lemire

I'll make some trials for this with these priority

  1. Don't break or delete anything.
  2. Keep efficiency of memory and speed.
  3. Increase usability and maintenance ability

koron avatar Sep 05 '13 11:09 koron

We are in agreement.

lemire avatar Sep 05 '13 14:09 lemire

I repeatedly get compressed array larger than original, while compressing random integers. Tested with versions 0.1.5 and 0.1.6 for a number of codecs. For which of them the output size is identical. I suspect I'm missing something about how to use your library ? Yes, I've checked there is nothing looking like padding, no leading or trailing zeros. Here is my code

  val codecs = Array(
    new IntegratedComposition(new IntegratedBinaryPacking(),
      new IntegratedVariableByte()),
    new JustCopy(),
    new VariableByte(),
    new IntegratedVariableByte(),
    new Composition(new BinaryPacking(), new VariableByte()),
    new Composition(new NewPFD(), new VariableByte()),
    new Composition(new NewPFDS16(), new VariableByte()),
    new Composition(new OptPFDS9(), new VariableByte()),
    new Composition(new OptPFDS16(), new VariableByte()),
    new Composition(new FastPFOR128(), new VariableByte()),
    new Composition(new FastPFOR(), new VariableByte()),
    new Simple9(),
    new Simple16(),
    new Composition(new XorBinaryPacking(), new VariableByte()),
    new Composition(new DeltaZigzagBinaryPacking(),
      new DeltaZigzagVariableByte()))

  def test2() = {
    val N = 100
    val r = new Random(System.nanoTime())
    val keys = (1 to N).map(i => {
      r.nextInt(Int.MaxValue)})
      .distinct.toArray
    println(keys.toList)
    println(s"size before compression ${keys.size}")
    codecs.map(codec => {
      print(s"${codec.toString} , ")
      val iic = new IntegratedIntCompressor()
      var start = System.nanoTime()
      val compressed = iic.compress(keys) // compressed array
      val encodeTime = System.nanoTime() - start
      print(s" $encodeTime , ")
      print(s" ${compressed.size} , ")
      start = System.nanoTime()
      val recov = iic.uncompress(compressed)
      val decodeTime = System.nanoTime() - start
      println(s" $decodeTime , ")
      //println(compressed.toList)
    })
  }

kk00ss avatar Dec 09 '15 18:12 kk00ss

@kk00ss

Please consult our README on what the library does:

https://github.com/lemire/JavaFastPFOR#what-does-this-do

lemire avatar Dec 09 '15 18:12 lemire

Sorry, my bad.

kk00ss avatar Dec 09 '15 18:12 kk00ss