pcompress
pcompress copied to clipboard
frequency based chunking
hey, was just reading this paper http://www-users.cs.umn.edu/~lv/FBC.pdf and looked for implementations but didn't find any. it proposes some pretty interesting space saving capabilities that go beyond e.g. rabin fingerprinting.
was just curious if you've seen that paper and considered that approach
I have looked at this. However, the approach I used provides for high accuracy (>95% efficiency of brute force) and extreme space savings. It is possible to achieve petascale matching with a RAM based high-level index.