roaring icon indicating copy to clipboard operation
roaring copied to clipboard

Start validation

Open bearrito opened this issue 1 year ago • 3 comments

@lemire

  1. Starts work for https://github.com/RoaringBitmap/roaring/issues/400
  2. I read the original paper on Roaring as well as the one adding run containers, so I think I understand the very basics of the invariants to be checked. I'm sure there are other approaches.
  3. If the overall approach looks good, I'm planning on looking at more ways to corrupt the containers via code eg changing container sizes, breaking sort orders.
  4. I looked at the serialization methods, and have a rough understanding of how that works. Do you think it's worthwhile to try to corrupt a serialized byte stream and see if it's detected when de-serialized. I think that would be moderately difficult for me to do but might be a good test.

bearrito avatar May 01 '24 00:05 bearrito

It looks like the tests fail. Can you have a look?

This is quite good overall. See my minor comments.

Very useful contribution,

lemire avatar May 01 '24 01:05 lemire

I looked at the serialization methods, and have a rough understanding of how that works. Do you think it's worthwhile to try to corrupt a serialized byte stream and see if it's detected when de-serialized. I think that would be moderately difficult for me to do but might be a good test.

It could be interesting to try to deserialize garbage content and check that you do get errors.

lemire avatar May 01 '24 01:05 lemire

Could you help me understand this invariant

If a run container has more than 4096 distinct values, then it must have no more than 2047 runs, otherwise the number
of runs must be less than half the number of distinct values

What is distinct values here? Does this mean if the sum of the runlength for all intervals is greater than 4096 the len(iv) <= 2047?

bearrito avatar May 02 '24 15:05 bearrito

@lemire Anything else on this? I'm going on holiday and am closing out my open pr's

bearrito avatar May 23 '24 13:05 bearrito

@bearrito If you recommend merging this, I will!!!

lemire avatar May 23 '24 13:05 lemire

@lemire Can you comment on my question above on the run-length invariant. I think I can get one more set of tests of that if I understand better.

After that, I'll feel good merging.

bearrito avatar May 23 '24 22:05 bearrito

@bearrito

a container has between 1 and 2^16 distinct values (in the set 0,1...,2^16-1)
if number of distinct values in the container >= 2048 then
    check that the number of runs is no more than 2047
    (otherwise you could use a bitset container)
else
    check that the number of runs < (number of distinct values) / 2
    (otherwise you could use an array container)

lemire avatar May 24 '24 00:05 lemire

Counting the number of distinct values in a run container is basically just the sum of the lengths of the runs.... (remember that the runs are non-overlapping)

lemire avatar May 24 '24 00:05 lemire

Added in distinct value sum/run cardinality logic.

I don't think I can add anymore test pressure on this.

bearrito avatar May 25 '24 03:05 bearrito

Running tests. I expect to merge.

lemire avatar May 25 '24 04:05 lemire

@lemire will do.

bearrito avatar May 26 '24 23:05 bearrito

@lemire @Oppen

Added extra invariants.

Added support for reading from Frozen. Adding tests for Frozen.

bearrito avatar May 30 '24 00:05 bearrito

Running tests.

lemire avatar May 30 '24 15:05 lemire

Merging.

lemire avatar Jun 07 '24 14:06 lemire