Add C-minHash variant
C-minHash (or Circulant minHash) needs only two permutations versus K permutations, in practice. This minHash variant could save significant storage space for large-scale data.
source: https://arxiv.org/abs/2109.03337
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
- need to be stored
Apologies for the typo. On Sunday, March 19, 2023, 10:59:30 a.m. PDT, Kevin Jones @.***> wrote:
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hmm. I am not sure I am getting this correctly. Reading Algorithm 3 C-MinHash-(σ, π), it seems to me that the hash values are still needed from the first permutation. So let's say we have a set { a, b, c, d, e } and after hashing and first permutation function, it becomes { 12, 32, 43, 231, 4 }. After you applying k (e.g., k = 3) circular minhash it becomes {32, 34, 43} -- I just made these numbers up. So if my understanding is correct, then you either store the minimum hash values in the end (same as the classic minhash) or you store the original hash values, which is what we want to avoid in the first place.
On Sun, Mar 19, 2023 at 11:10 AM icsa @.***> wrote:
- need to be stored
Apologies for the typo. On Sunday, March 19, 2023, 10:59:30 a.m. PDT, Kevin Jones @.***> wrote:
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475350518, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLWSK3TTCH2RXFR2QPLW45DZHANCNFSM6AAAAAAWAHDJKA . You are receiving this because you commented.Message ID: @.***>
The "Circulant" part means that you can use a random permutation then rotate the permutation by k (of K) items to create "new" permutations.
One permutation gets used K times under rotation instead of K separate random permutations. The rotated permutations can be computed "on the fly" and discarded. They do not need to be stored. On Sunday, March 19, 2023, 09:54:02 p.m. PDT, Eric Zhu @.***> wrote:
Hmm. I am not sure I am getting this correctly. Reading Algorithm 3 C-MinHash-(σ, π), it seems to me that the hash values are still needed from the first permutation. So let's say we have a set { a, b, c, d, e } and after hashing and first permutation function, it becomes { 12, 32, 43, 231, 4 }. After you applying k (e.g., k = 3) circular minhash it becomes {32, 34, 43} -- I just made these numbers up. So if my understanding is correct, then you either store the minimum hash values in the end (same as the classic minhash) or you store the original hash values, which is what we want to avoid in the first place.
On Sun, Mar 19, 2023 at 11:10 AM icsa @.***> wrote:
- need to be stored
Apologies for the typo. On Sunday, March 19, 2023, 10:59:30 a.m. PDT, Kevin Jones @.***> wrote:
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475350518, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLWSK3TTCH2RXFR2QPLW45DZHANCNFSM6AAAAAAWAHDJKA . You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
But how do you compute the minimum value for each permutation function without having access to the input hash values?
On Sun, Mar 19, 2023 at 10:04 PM icsa @.***> wrote:
The "Circulant" part means that you can use a random permutation then rotate the permutation by k (of K) items to create "new" permutations.
One permutation gets used K times under rotation instead of K separate random permutations. The rotated permutations can be computed "on the fly" and discarded. They do not need to be stored. On Sunday, March 19, 2023, 09:54:02 p.m. PDT, Eric Zhu @.***> wrote:
Hmm. I am not sure I am getting this correctly. Reading Algorithm 3 C-MinHash-(σ, π), it seems to me that the hash values are still needed from the first permutation. So let's say we have a set { a, b, c, d, e } and after hashing and first permutation function, it becomes { 12, 32, 43, 231, 4 }. After you applying k (e.g., k = 3) circular minhash it becomes {32, 34, 43} -- I just made these numbers up. So if my understanding is correct, then you either store the minimum hash values in the end (same as the classic minhash) or you store the original hash values, which is what we want to avoid in the first place.
On Sun, Mar 19, 2023 at 11:10 AM icsa @.***> wrote:
- need to be stored
Apologies for the typo. On Sunday, March 19, 2023, 10:59:30 a.m. PDT, Kevin Jones @.***> wrote:
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub <https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475350518 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACOGLWSK3TTCH2RXFR2QPLW45DZHANCNFSM6AAAAAAWAHDJKA
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475628132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLTNUDXGUNJ5WJVCRRLW47QO7ANCNFSM6AAAAAAWAHDJKA . You are receiving this because you commented.Message ID: @.***>
I mean "having access to the values after 1st permutation"
On Sun, Mar 19, 2023 at 10:09 PM Eric Zhu @.***> wrote:
But how do you compute the minimum value for each permutation function without having access to the input hash values?
On Sun, Mar 19, 2023 at 10:04 PM icsa @.***> wrote:
The "Circulant" part means that you can use a random permutation then rotate the permutation by k (of K) items to create "new" permutations.
One permutation gets used K times under rotation instead of K separate random permutations. The rotated permutations can be computed "on the fly" and discarded. They do not need to be stored. On Sunday, March 19, 2023, 09:54:02 p.m. PDT, Eric Zhu @.***> wrote:
Hmm. I am not sure I am getting this correctly. Reading Algorithm 3 C-MinHash-(σ, π), it seems to me that the hash values are still needed from the first permutation. So let's say we have a set { a, b, c, d, e } and after hashing and first permutation function, it becomes { 12, 32, 43, 231, 4 }. After you applying k (e.g., k = 3) circular minhash it becomes {32, 34, 43} -- I just made these numbers up. So if my understanding is correct, then you either store the minimum hash values in the end (same as the classic minhash) or you store the original hash values, which is what we want to avoid in the first place.
On Sun, Mar 19, 2023 at 11:10 AM icsa @.***> wrote:
- need to be stored
Apologies for the typo. On Sunday, March 19, 2023, 10:59:30 a.m. PDT, Kevin Jones @.***> wrote:
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub <https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475350518 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACOGLWSK3TTCH2RXFR2QPLW45DZHANCNFSM6AAAAAAWAHDJKA
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475628132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLTNUDXGUNJ5WJVCRRLW47QO7ANCNFSM6AAAAAAWAHDJKA . You are receiving this because you commented.Message ID: @.***>
But how do you compute the minimum value for each permutation function without having access to the input hash values?
Lazily, perhaps using a Python generator.The hash values would be computed dynamically from the dynamically generated permutation.
From the paper:
Algorithm 2 C-MinHash-(0, π) Input: Binary data vector v ∈ {0, 1}D , Permutation vector π: [D] → [D]Output: Hash values h1(v), ..., hK (v)
For k = 1 to K
Shift π circulantly rightwards by k units: πk = π→k hk(v) ← mini:vi̸=0 π→k(i) // hash(k) <- apply the "virtual" permutation to the data and find the first bit set
End For
On Sunday, March 19, 2023, 10:10:50 p.m. PDT, Eric Zhu ***@***.***> wrote:
I mean "having access to the values after 1st permutation"
On Sun, Mar 19, 2023 at 10:09 PM Eric Zhu @.***> wrote:
But how do you compute the minimum value for each permutation function without having access to the input hash values?
On Sun, Mar 19, 2023 at 10:04 PM icsa @.***> wrote:
The "Circulant" part means that you can use a random permutation then rotate the permutation by k (of K) items to create "new" permutations.
One permutation gets used K times under rotation instead of K separate random permutations. The rotated permutations can be computed "on the fly" and discarded. They do not need to be stored. On Sunday, March 19, 2023, 09:54:02 p.m. PDT, Eric Zhu @.***> wrote:
Hmm. I am not sure I am getting this correctly. Reading Algorithm 3 C-MinHash-(σ, π), it seems to me that the hash values are still needed from the first permutation. So let's say we have a set { a, b, c, d, e } and after hashing and first permutation function, it becomes { 12, 32, 43, 231, 4 }. After you applying k (e.g., k = 3) circular minhash it becomes {32, 34, 43} -- I just made these numbers up. So if my understanding is correct, then you either store the minimum hash values in the end (same as the classic minhash) or you store the original hash values, which is what we want to avoid in the first place.
On Sun, Mar 19, 2023 at 11:10 AM icsa @.***> wrote:
- need to be stored
Apologies for the typo. On Sunday, March 19, 2023, 10:59:30 a.m. PDT, Kevin Jones @.***> wrote:
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub <https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475350518 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACOGLWSK3TTCH2RXFR2QPLW45DZHANCNFSM6AAAAAAWAHDJKA
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475628132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLTNUDXGUNJ5WJVCRRLW47QO7ANCNFSM6AAAAAAWAHDJKA . You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Sorry I am still not getting this. Can you show me a Python example of lazy computation?
On Sun, Mar 19, 2023 at 10:23 PM icsa @.***> wrote:
But how do you compute the minimum value for each permutation function without having access to the input hash values?
Lazily, perhaps using a Python generator.The hash values would be computed dynamically from the dynamically generated permutation.
From the paper:
Algorithm 2 C-MinHash-(0, π) Input: Binary data vector v ∈ {0, 1}D , Permutation vector π: [D] → [D]Output: Hash values h1(v), ..., hK (v)
For k = 1 to K
Shift π circulantly rightwards by k units: πk = π→k hk(v) ← mini:vi̸=0 π→k(i) // hash(k) <- apply the "virtual" permutation to the data and find the first bit set
End For
On Sunday, March 19, 2023, 10:10:50 p.m. PDT, Eric Zhu @.***> wrote:
I mean "having access to the values after 1st permutation"
On Sun, Mar 19, 2023 at 10:09 PM Eric Zhu @.***> wrote:
But how do you compute the minimum value for each permutation function without having access to the input hash values?
On Sun, Mar 19, 2023 at 10:04 PM icsa @.***> wrote:
The "Circulant" part means that you can use a random permutation then rotate the permutation by k (of K) items to create "new" permutations.
One permutation gets used K times under rotation instead of K separate random permutations. The rotated permutations can be computed "on the fly" and discarded. They do not need to be stored. On Sunday, March 19, 2023, 09:54:02 p.m. PDT, Eric Zhu @.***> wrote:
Hmm. I am not sure I am getting this correctly. Reading Algorithm 3 C-MinHash-(σ, π), it seems to me that the hash values are still needed from the first permutation. So let's say we have a set { a, b, c, d, e } and after hashing and first permutation function, it becomes { 12, 32, 43, 231, 4 }. After you applying k (e.g., k = 3) circular minhash it becomes {32, 34, 43} -- I just made these numbers up. So if my understanding is correct, then you either store the minimum hash values in the end (same as the classic minhash) or you store the original hash values, which is what we want to avoid in the first place.
On Sun, Mar 19, 2023 at 11:10 AM icsa @.***> wrote:
- need to be stored
Apologies for the typo. On Sunday, March 19, 2023, 10:59:30 a.m. PDT, Kevin Jones @.***> wrote:
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub < https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475350518 , or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AACOGLWSK3TTCH2RXFR2QPLW45DZHANCNFSM6AAAAAAWAHDJKA
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub <https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475628132 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACOGLTNUDXGUNJ5WJVCRRLW47QO7ANCNFSM6AAAAAAWAHDJKA
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475643069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLRKJTPLE5MCPPBQMPTW47STTANCNFSM6AAAAAAWAHDJKA . You are receiving this because you commented.Message ID: @.***>
Let me work on a python implementation of the C-minHash algorithm to demonstrate the concept in context. I'll get you the outline of the code in the next few days. I'm quite busy, atm, using your library to great effect :) In the mean time, here is a example of lazy computation in python:
def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a + bThe fibonacci values are computed on demand, one at a time, as needed. Each call to fibonacci() produces one subsequent value. source: Lazy evaluation in Python
| | | | Lazy evaluation in Python
Jakub Konka
Jakub Konka's weblog about programming, game theory, maths, and more. |
|
|
On Sunday, March 19, 2023, 11:00:25 p.m. PDT, Eric Zhu ***@***.***> wrote:
Sorry I am still not getting this. Can you show me a Python example of lazy computation?
On Sun, Mar 19, 2023 at 10:23 PM icsa @.***> wrote:
But how do you compute the minimum value for each permutation function without having access to the input hash values?
Lazily, perhaps using a Python generator.The hash values would be computed dynamically from the dynamically generated permutation.
From the paper:
Algorithm 2 C-MinHash-(0, π) Input: Binary data vector v ∈ {0, 1}D , Permutation vector π: [D] → [D]Output: Hash values h1(v), ..., hK (v)
For k = 1 to K
Shift π circulantly rightwards by k units: πk = π→k hk(v) ← mini:vi̸=0 π→k(i) // hash(k) <- apply the "virtual" permutation to the data and find the first bit set
End For
On Sunday, March 19, 2023, 10:10:50 p.m. PDT, Eric Zhu @.***> wrote:
I mean "having access to the values after 1st permutation"
On Sun, Mar 19, 2023 at 10:09 PM Eric Zhu @.***> wrote:
But how do you compute the minimum value for each permutation function without having access to the input hash values?
On Sun, Mar 19, 2023 at 10:04 PM icsa @.***> wrote:
The "Circulant" part means that you can use a random permutation then rotate the permutation by k (of K) items to create "new" permutations.
One permutation gets used K times under rotation instead of K separate random permutations. The rotated permutations can be computed "on the fly" and discarded. They do not need to be stored. On Sunday, March 19, 2023, 09:54:02 p.m. PDT, Eric Zhu @.***> wrote:
Hmm. I am not sure I am getting this correctly. Reading Algorithm 3 C-MinHash-(σ, π), it seems to me that the hash values are still needed from the first permutation. So let's say we have a set { a, b, c, d, e } and after hashing and first permutation function, it becomes { 12, 32, 43, 231, 4 }. After you applying k (e.g., k = 3) circular minhash it becomes {32, 34, 43} -- I just made these numbers up. So if my understanding is correct, then you either store the minimum hash values in the end (same as the classic minhash) or you store the original hash values, which is what we want to avoid in the first place.
On Sun, Mar 19, 2023 at 11:10 AM icsa @.***> wrote:
- need to be stored
Apologies for the typo. On Sunday, March 19, 2023, 10:59:30 a.m. PDT, Kevin Jones @.***> wrote:
Given that the hash values can be reproduced deterministically, they don't ned to be stored. The hash values can be recomputed lazily - trading compute for space. On Sunday, March 19, 2023, 10:49:25 a.m. PDT, Eric Zhu @.***> wrote:
This is interesting! From my quick read of the abstract it seems improvement is in reducing the number of permutation functions, but not really reducing the number of hash values stored? Am I missing something?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub < https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475350518 , or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AACOGLWSK3TTCH2RXFR2QPLW45DZHANCNFSM6AAAAAAWAHDJKA
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub <https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475628132 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACOGLTNUDXGUNJ5WJVCRRLW47QO7ANCNFSM6AAAAAAWAHDJKA
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/203#issuecomment-1475643069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLRKJTPLE5MCPPBQMPTW47STTANCNFSM6AAAAAAWAHDJKA . You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Looking forward to see what you build!