datasketch Advice for compression of a big graph

I am working with a large dataset consisting of approximately 1 billion sets, where each set contains integers between 0 and one billion, with sizes typically in the hundreds(these sets represent a graph). I am looking to compress them while preserving the approximate number of common items(or a metric that correlates with it). My only use case is to have the ability to quickly compare two set ids. Ideally, I want to map each set to a short vector, so that the dot product will be the approximate number of common elements, and then use this to create a numpy array or put it into faiss. My main concerns are speed and RAM usage. Could I use your library for this task, and do you have any advice? Thank you for your time and help.

Jan 22 '23 15:01 ubbikk

Thanks for contributing your issue. You can try our MinHashLSH index with Redis storage backend for scale. With that many sets, depending on the size of sets, you probably still need a beefy machine with enough memory to hold the MinHashLSH index backed by Redis.

Let's say you have N sets, the memory usage of all MinHash sketches will be at minimum N*num_perm*8 bytes. The memory size of the index is approximately equal to the total size of all MinHash sketches. Let's say N = 10^9, num_perm = 64, this is 512 gb memory needed for Redis.

Jan 24 '23 05:01 ekzhu

Consider Roaring Bitmaps to represent each [compressed] set of integers.

https://github.com/RoaringBitmap

Jul 19 '23 16:07 icsa

Interesting. Have you tried using this type of integer compression for MinHash signatures?

Consider Roaring Bitmaps to represent each [compressed] set of integers.

https://github.com/RoaringBitmap

Aug 17 '23 19:08 ekzhu