hi_sparse_bitset icon indicating copy to clipboard operation
hi_sparse_bitset copied to clipboard

PartialOrd/Ord implementation possibility

Open 3Hren opened this issue 1 year ago • 8 comments

Hi again!

There is the final step to take full advantage of your great library.

Sometimes it is necessary to deduplicate bitsets or to build a mapping between bitsets and some values. There are two standard approaches: using HashMap or BTreeMap.

The first one requires to calculate bitset's hash, which, in case of large bitsets, can be painful. However, for BTreeMap it seems that a multilevel approach to comparing two bitsets will result in a very fast comparison, also lazy bitsets seems to perfectly fit in this case. Since the PartialEq is already implemented it seems that it will no harm to also implement PartialOrd/Ord. Or is there something that prevents this from being implemented?

3Hren avatar Jun 24 '24 17:06 3Hren

If you want to "deduplicate bitsets", why not substract? S1 - S2 will give you S1 without S2 elements.

If you want to map indices to values, you need https://github.com/tower120/hi_sparse_array . It is somewhat like hi_sparse_bitset, but last level points to the actual data. It is still in research phase.

And what is the meaning of sets order? When S1>S2?

tower120 avatar Jun 24 '24 18:06 tower120

By deduplicating/mapping I meant that suppose we have the following mapping:

{
  0b10101 -> 0,
  0b10101 -> 1,
  0b10001 -> 2,
  ...
  0b10001 -> 100500,
}

I would like to transform it in a such way:

{
  0b10101 -> [0, 1],
  0b10001 -> [2, 100500],
  ...
}

Using BTreeMap this is trivial.

By order I mean that in fact each bitset can be represented as a very large unsigned integer. So why not allow to comparing them?

3Hren avatar Jun 24 '24 18:06 3Hren

Looks like you need hi_sparse_array's multi_merge. There is a version of it now in repository, but I rewriting it to new more space efficient data structure.

Ord is probably possible. But looks like what you want to do will be inefficient (since you'll need to process each bitset one-by-one). Again, looks like you need merge. And I don't clearly see how to implement Ord... Especially for non TRUSTED_HIERARCHY. It's not trivial algorithm (maybe I'm wrong).

tower120 avatar Jun 24 '24 19:06 tower120

Well, if you "just" need Ord, and don't care about speed - you probably can compare block_iter() between two sets. (Don't forget to take block starts into count as well).

tower120 avatar Jun 24 '24 19:06 tower120

Isn't it enough to just to compare levels one-by-one? First try level0 indices, if they are equal then level1, and finally, if again equal - block-by-block? I'm not familiar with TRUSTED_HIERARCHY internals, will take a look.

Well, if you "just" need Ord, and don't care about speed - you probably can compare block_iter() between two sets. (Don't forget to take block starts into count as well).

Of course I care, that's why I am here ;)

Anyway, thanks for the answer!

3Hren avatar Jun 24 '24 19:06 3Hren

For TRUSTED_HIERARCHY maybe that is enough (I can't see that far without practical implementation).

But if you work, for example, with lazy(!) result of intersection - raised bits in level bitmasks may lead to EMPTY data blocks. That case is called non-trusted-hierarchy. You can't trust hierarchy and just compare level bitmasks. From Eq implementation experience thou, there is a "special case" when you compare TRUSTED_HIERARCHY to !TRUSTED_HIERARCHY - then some optimization possible. Otherwise - (as far as I remember) it ends up with plain data block comparison one-by-one.

tower120 avatar Jun 24 '24 19:06 tower120

It is just with multi_merge/fold_merge you basically OR bitmasks of all maps/arrays at once, and then just apply resolve function to intersected data items in the very last level. At least on paper that should be super-linearly faster then what you trying to do.

tower120 avatar Jun 24 '24 19:06 tower120

Ah, I see it's more complicated than I thought, interesting nonetheless. Thanks for the clarification!

3Hren avatar Jun 24 '24 20:06 3Hren

Isn't it enough to just to compare levels one-by-one? First try level0 indices, if they are equal then level1, and finally, if again equal - block-by-block?

I looked at that suggestion closely again. It is not possible to implement lexicographical compare in that way even for TRUSTED_HIERARCHY (like BitSet container). Because each bit of bitblock at each level shows if there is NON EMPTY block down below the hierarchy for some particular index range. So even if root bitblocks are equal - doesn't mean that it's childs are equal. BUT! If they're NOT equal - it does means that their childs not equal, because there is at least one index range that is empty in one bitset, but not empty at another.

The correct way to lexicographically compare bitsets:

  1. In lock step compare bitblocks from both bitsets for equality (super fast operation), until the very first that is not equal.
  2. Starting from that bitblock proceed comparing by index iterator - diverge bit will be in this bitblock for one of the bitset, and in this or in the next one for another bitset. By saying "this bitblock" - I mean bitblock at "this" index.

This means that there is no best-case scenario where you can cmp two bitsets just by looking at their root blocks, or even somehow find tree subsection where it diverge FIRST TIME (this is crucial limitation for lexicographical comparison).

This Ord be done manually, if needed, with block_iter() - there is nothing lib can speed up in this operation.


hibitset, roaring does not implement Ord - since it can't accelerate it. I think hi_sparse_bitset should follow the same approach. At least for now.

tower120 avatar Jul 27 '25 22:07 tower120