Start validation
@lemire
- Starts work for https://github.com/RoaringBitmap/roaring/issues/400
- I read the original paper on Roaring as well as the one adding run containers, so I think I understand the very basics of the invariants to be checked. I'm sure there are other approaches.
- If the overall approach looks good, I'm planning on looking at more ways to corrupt the containers via code eg changing container sizes, breaking sort orders.
- I looked at the serialization methods, and have a rough understanding of how that works. Do you think it's worthwhile to try to corrupt a serialized byte stream and see if it's detected when de-serialized. I think that would be moderately difficult for me to do but might be a good test.
It looks like the tests fail. Can you have a look?
This is quite good overall. See my minor comments.
Very useful contribution,
I looked at the serialization methods, and have a rough understanding of how that works. Do you think it's worthwhile to try to corrupt a serialized byte stream and see if it's detected when de-serialized. I think that would be moderately difficult for me to do but might be a good test.
It could be interesting to try to deserialize garbage content and check that you do get errors.
Could you help me understand this invariant
If a run container has more than 4096 distinct values, then it must have no more than 2047 runs, otherwise the number
of runs must be less than half the number of distinct values
What is distinct values here? Does this mean if the sum of the runlength for all intervals is greater than 4096 the len(iv) <= 2047?
@lemire Anything else on this? I'm going on holiday and am closing out my open pr's
@bearrito If you recommend merging this, I will!!!
@lemire Can you comment on my question above on the run-length invariant. I think I can get one more set of tests of that if I understand better.
After that, I'll feel good merging.
@bearrito
a container has between 1 and 2^16 distinct values (in the set 0,1...,2^16-1)
if number of distinct values in the container >= 2048 then
check that the number of runs is no more than 2047
(otherwise you could use a bitset container)
else
check that the number of runs < (number of distinct values) / 2
(otherwise you could use an array container)
Counting the number of distinct values in a run container is basically just the sum of the lengths of the runs.... (remember that the runs are non-overlapping)
Added in distinct value sum/run cardinality logic.
I don't think I can add anymore test pressure on this.
Running tests. I expect to merge.
@lemire will do.
@lemire @Oppen
Added extra invariants.
Added support for reading from Frozen. Adding tests for Frozen.
Running tests.
Merging.