intcomp icon indicating copy to clipboard operation
intcomp copied to clipboard

Is it possible to implement Search/Seek methods?

Open AskAlexSharov opened this issue 2 years ago • 4 comments

Is it theoretically possible? Example:

comp := intcomp.compress([10,20,30])

// get number by value
incomp.Search(comp, 10) // 1
incomp.Search(comp, 20) // 2
incomp.Search(comp, 30) // 3

// find value >= v
incomp.Seek(comp, 8) // 10
incomp.Seek(comp, 10) // 10
incomp.Search(comp, 17) // 20
incomp.Search(comp, 21) // 30
incomp.Search(comp, 31) // -1

AskAlexSharov avatar Mar 10 '23 05:03 AskAlexSharov

Perhaps, it would be helpful to have an iterating method (like (sync.Map).Range), so that reslicing, searching, seeking, etc... is possible in streaming way (without allocating a full decompressed list). Though, performance might be not great compared to specialized functions.

vearutop avatar Mar 10 '23 10:03 vearutop

What about implementing an iterator? I find it more practical to use when you need to iterate over multiple arrays at the same time.

Something like this:

it:= NewInt32Iterator(comp)
for v, valid := it.Next(); valid; v, valid = it.Next() {
   // do whatever with v...
}

A reverse iterator would also be useful, but not sure if it's possible to implement efficiently.

Arbitrary reslicing, is not feasible since encoding is performed by blocks of 128 or 256 values.

ronanh avatar Mar 10 '23 11:03 ronanh

Having way for stream-style iteration - useful feature (and for me also).

But I did mean another thing: to do Search/Seek without touching/decompress all previous values. Use case: imagine I do store very large compressed immutable list in MMAP file. As less places in compressed form I will touch - less disk reads (and RAM) will happen. So, in this case - iterator is the worst case (It will brude-force touch all compressed bytes and read them from disk).

In my use-case I have many (>> RAM) such lists (actually it's lists of offsets in some files). FYI: Now we using some made-by-us implementation of EliasFano encoding.

AskAlexSharov avatar Mar 11 '23 05:03 AskAlexSharov

Right now the compression scheme doesn't allow decoding at an arbitrary position in the compressed buffer. So it's not really possible to implement a binary search.

I am considering modifying the compression logic in order to force the creation of new blocks at fixed positions in the compressed buffer. This has a small cost as a few bytes may be wasted at the end of the blocks, but it will allow decompression at arbitrary position.

I have updated the README with next steps in this direction.

ronanh avatar Mar 15 '23 21:03 ronanh