PTRHash PTRHash fails to build on 10^12 keys

This is with the parameters in many_keys.rs, just changing the number of keys:

[...]
part    507 bucket  0.01%  chain:   8388608
Too many displacements. Aborting!
Possible causes:
- Too many elements in part.
- Not enough empty slots => lower alpha.
- Not enough buckets     => increase c.
- Not enough entropy     => fix algorithm.

part   1350 bucket  0.03%  chain:   8388608
Too many displacements. Aborting!
Possible causes:
- Too many elements in part.
- Not enough empty slots => lower alpha.
- Not enough buckets     => increase c.
- Not enough entropy     => fix algorithm.

thread 'main' panicked at /home/vigna/git/PTRHash/src/lib.rs:369:13:
Failed to find a global seed after 3 tries for 262144 keys.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I can send the whole log if that's useful.

Nov 18 '23 07:11 vigna

BTW, somewhat unrelated, but this comment

    // Since running all queries is super slow, we only check a subset of them.
    // Although this doesn't completely check that there are no duplicate
    // mappings, by the birthday paradox we can be quite sure there are none
    // since we check way more than sqrt(n) of them.

has no mathematical basis. It would be correct if the output of the function was random, but it is not. If there are exactly two keys colliding the probability of hitting them is the probability of having exactly those two keys in the subset you test, and this has nothing to do with the birthday paradox.

Nov 18 '23 07:11 vigna

Yeah so as the error says, the algorithm only works well if there are sufficiently many buckets and slots. You probably want to run with c=11 to have more buckets and significantly increase the probability that displacing finds a solution.

I still have to work out the math at some point. Currently the number of buckets scales as c * n / lg(n) (as does PTHash), but in practice it seems that the default c=9 fails for larger n, so it's needed to compensate manually for the /lg(n) by increasing c.

Ah yeah I did realize at some point that the comment doesn't really make sense indeed. will fix it at some point.

It would be cool to be able to efficiently check all 10^12 keys using multiple threads. Maybe each thread could store a buffer of 10^6 to 10^9 keys, and then flush/write that once full. But still there is so much writing going on it's bound to be slow I suspect.

Nov 18 '23 09:11 RagnarGrootKoerkamp

That was 10, I'll try 11.

On November 18, 2023 10:13:21 AM GMT+01:00, Ragnar Groot Koerkamp @.***> wrote:

Yeah so as the error says, the algorithm only works well if there are sufficiently many buckets and slots. You probably want to run with c=10 or c=11 to have more buckets and significantly increase the probability that displacing finds a solution.

I still have to work out the math at some point. Currently the number of buckets scales as c * n / lg(n) (as does PTHash), but in practice it seems that the default c=9 fails for larger n, so it's needed to compensate manually for the /lg(n) by increasing c.

-- Reply to this email directly or view it on GitHub: https://github.com/RagnarGrootKoerkamp/PTRHash/issues/6#issuecomment-1817452941 You are receiving this because you authored the thread.

Message ID: @.***>

Nov 18 '23 09:11 vigna

Did this end up working?

Nov 21 '23 10:11 RagnarGrootKoerkamp

I tried with c=11 and a single shard but I got

        keys: 1000000000000
      shards:          1
       parts:    3892549
   slots/prt:     262144
   slots tot: 1020408365056
 buckets/prt:      72285
 buckets tot: 281372904465
 keys/bucket:          3.55
memory allocation of 2147483648 bytes failed

Now trying with more shards.

Nov 21 '23 10:11 vigna

It will store n/s hashes per shard. Do you have 10^12 / 1 * 16byte = 16TB of RAM? :sweat_smile:

Probably you want ~1000 shards to get ~16GB memory per shard.

Nov 21 '23 11:11 RagnarGrootKoerkamp

With c=11 the first round failed. I'll make it do another couple of rounds. I'm using 1000 shards.

Nov 22 '23 13:11 vigna

Then c=12 maybe?

Nov 22 '23 13:11 RagnarGrootKoerkamp

Or maybe try alpha=0.97 or a bit lower. Its hard to know exactly where is the bottleneck.

Nov 22 '23 18:11 RagnarGrootKoerkamp

It didn't stop, but it is still at 800/1000 chunks after three days. Hardly usable in practice. If it finishes I'll try with a smaller alpha.

Nov 25 '23 07:11 vigna

Hmm. That's weird. Did you enable sharding to disk? Without it it will hash all keys once for each shard which is quite a bit slower probably.

Nov 26 '23 06:11 RagnarGrootKoerkamp

Oh no. Well, from my experience, it is better that default options offer a reasonable behavior. People won't always come to you for explanations—they'll just move elsewhere.

Nov 26 '23 08:11 vigna

On a better note: it completed! But it didn't print any stats about space, unless " keys/bucket: 3.26" it is. That's where serialization is really useful—you can measure the file to double check.

Nov 26 '23 08:11 vigna

Will it obey to TMPDIR? I can't find where it's creating the shards...

Nov 26 '23 08:11 vigna

Yeah well... People using this for up to 10^9 elements probably don't want sharding to disk and this program writing disk unexpectedly. Maybe I'll make a separate function build_on_disk for visibility and clarity. TMPDIR env var should work yes.

It should normally print a bits/element stat at the end, but at some point I broke that and it was printing 0 instead. I think that's fixed now.

Nov 26 '23 08:11 RagnarGrootKoerkamp

Well, you might just log what you're doing, since you're already logging anyway. Something like "100 shards, keys will be re-read 100 times (switch to offline sharding for single-pass)". At least people will know what's happening and they can take action.

I agree on the "unexpectedly on disk" viewpoint—VFunc does the same. But in the VFunc case if things do not go well you'll have an out-of-memory error, and people will look for offline options because that's what happens in 90% of these structures. The problem with PTRHash is that you do not get out-of-memory—you get slow. People does not take this behavior as a suggestion to change options, since it somehow works.

I modified the example to serialize (it'll be fun). I'll keep you posted. Probably with memory-mapping I can even try to validate the structure.

Nov 26 '23 08:11 vigna

Yeah well... People using this for up to 10^9 elements probably don't want sharding to disk and this program writing disk unexpectedly. Maybe I'll make a separate function build_on_disk for visibility and clarity. TMPDIR env var should work yes.

It does!

TMP PATH: "/mnt/extra/analysis/vigna/.tmpnLSHXt"

Nov 26 '23 08:11 vigna

Well, you might just log what you're doing, since you're already logging anyway. Something like "100 shards, keys will be re-read 100 times (switch to offline sharding for single-pass)". At least people will know what's happening and they can take action.

Good idea. I will add some logs.

I agree on the "unexpectedly on disk" viewpoint—VFunc does the same. But in the VFunc case if things do not go well you'll have an out-of-memory error, and people will look for offline options because that's what happens in 90% of these structures. The problem with PTRHash is that you do not get out-of-memory—you get slow. People does not take this behavior as a suggestion to change options, since it somehow works.

Hmm very interesting point. So it sounds like going out-of-memory if a preferred default behaviour over both sharding with rereading and sharding to disk. I'll probably just make 3 distinct entry points for these options.

Nov 29 '23 15:11 RagnarGrootKoerkamp

Hmm very interesting point. So it sounds like going out-of-memory if a preferred default behaviour over both sharding with rereading and sharding to disk. I'll probably just make 3 distinct entry points for these options.

"Is" is a big word :). That's my personal take on user psychology. Catastrophic disasters lead to search for solutions. Bad performance leads to think the that's the performance of the software. Sometimes the user will think "maybe I can fiddle", but I think less often than with a catastrophic result.

Nov 29 '23 18:11 vigna

PTRHash PTRHash copied to clipboard

PTRHash fails to build on 10^12 keys

PTRHash
PTRHash copied to clipboard