PTRHash
PTRHash copied to clipboard
PTRHash fails to build on 10^12 keys
This is with the parameters in many_keys.rs, just changing the number of keys:
[...]
part 507 bucket 0.01% chain: 8388608
Too many displacements. Aborting!
Possible causes:
- Too many elements in part.
- Not enough empty slots => lower alpha.
- Not enough buckets => increase c.
- Not enough entropy => fix algorithm.
part 1350 bucket 0.03% chain: 8388608
Too many displacements. Aborting!
Possible causes:
- Too many elements in part.
- Not enough empty slots => lower alpha.
- Not enough buckets => increase c.
- Not enough entropy => fix algorithm.
thread 'main' panicked at /home/vigna/git/PTRHash/src/lib.rs:369:13:
Failed to find a global seed after 3 tries for 262144 keys.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I can send the whole log if that's useful.
BTW, somewhat unrelated, but this comment
// Since running all queries is super slow, we only check a subset of them.
// Although this doesn't completely check that there are no duplicate
// mappings, by the birthday paradox we can be quite sure there are none
// since we check way more than sqrt(n) of them.
has no mathematical basis. It would be correct if the output of the function was random, but it is not. If there are exactly two keys colliding the probability of hitting them is the probability of having exactly those two keys in the subset you test, and this has nothing to do with the birthday paradox.
Yeah so as the error says, the algorithm only works well if there are sufficiently many buckets and slots. You probably want to run with c=11 to have more buckets and significantly increase the probability that displacing finds a solution.
I still have to work out the math at some point. Currently the number of buckets scales as c * n / lg(n) (as does PTHash), but in practice it seems that the default c=9 fails for larger n, so it's needed to compensate manually for the /lg(n) by increasing c.
Ah yeah I did realize at some point that the comment doesn't really make sense indeed. will fix it at some point.
It would be cool to be able to efficiently check all 10^12 keys using multiple threads. Maybe each thread could store a buffer of 10^6 to 10^9 keys, and then flush/write that once full. But still there is so much writing going on it's bound to be slow I suspect.
That was 10, I'll try 11.
On November 18, 2023 10:13:21 AM GMT+01:00, Ragnar Groot Koerkamp @.***> wrote:
Yeah so as the error says, the algorithm only works well if there are sufficiently many buckets and slots. You probably want to run with
c=10orc=11to have more buckets and significantly increase the probability that displacing finds a solution.I still have to work out the math at some point. Currently the number of buckets scales as c * n / lg(n) (as does PTHash), but in practice it seems that the default c=9 fails for larger n, so it's needed to compensate manually for the /lg(n) by increasing c.
-- Reply to this email directly or view it on GitHub: https://github.com/RagnarGrootKoerkamp/PTRHash/issues/6#issuecomment-1817452941 You are receiving this because you authored the thread.
Message ID: @.***>
Did this end up working?
I tried with c=11 and a single shard but I got
keys: 1000000000000
shards: 1
parts: 3892549
slots/prt: 262144
slots tot: 1020408365056
buckets/prt: 72285
buckets tot: 281372904465
keys/bucket: 3.55
memory allocation of 2147483648 bytes failed
Now trying with more shards.
It will store n/s hashes per shard. Do you have 10^12 / 1 * 16byte = 16TB of RAM? :sweat_smile:
Probably you want ~1000 shards to get ~16GB memory per shard.
With c=11 the first round failed. I'll make it do another couple of rounds. I'm using 1000 shards.
Then c=12 maybe?
Or maybe try alpha=0.97 or a bit lower. Its hard to know exactly where is the bottleneck.
It didn't stop, but it is still at 800/1000 chunks after three days. Hardly usable in practice. If it finishes I'll try with a smaller alpha.
Hmm. That's weird. Did you enable sharding to disk? Without it it will hash all keys once for each shard which is quite a bit slower probably.
Oh no. Well, from my experience, it is better that default options offer a reasonable behavior. People won't always come to you for explanations—they'll just move elsewhere.
On a better note: it completed! But it didn't print any stats about space, unless " keys/bucket: 3.26" it is. That's where serialization is really useful—you can measure the file to double check.
Will it obey to TMPDIR? I can't find where it's creating the shards...
Yeah well... People using this for up to 10^9 elements probably don't want sharding to disk and this program writing disk unexpectedly. Maybe I'll make a separate function build_on_disk for visibility and clarity. TMPDIR env var should work yes.
It should normally print a bits/element stat at the end, but at some point I broke that and it was printing 0 instead. I think that's fixed now.
Well, you might just log what you're doing, since you're already logging anyway. Something like "100 shards, keys will be re-read 100 times (switch to offline sharding for single-pass)". At least people will know what's happening and they can take action.
I agree on the "unexpectedly on disk" viewpoint—VFunc does the same. But in the VFunc case if things do not go well you'll have an out-of-memory error, and people will look for offline options because that's what happens in 90% of these structures. The problem with PTRHash is that you do not get out-of-memory—you get slow. People does not take this behavior as a suggestion to change options, since it somehow works.
I modified the example to serialize (it'll be fun). I'll keep you posted. Probably with memory-mapping I can even try to validate the structure.
Yeah well... People using this for up to 10^9 elements probably don't want sharding to disk and this program writing disk unexpectedly. Maybe I'll make a separate function build_on_disk for visibility and clarity. TMPDIR env var should work yes.
It does!
TMP PATH: "/mnt/extra/analysis/vigna/.tmpnLSHXt"
Well, you might just log what you're doing, since you're already logging anyway. Something like "100 shards, keys will be re-read 100 times (switch to offline sharding for single-pass)". At least people will know what's happening and they can take action.
Good idea. I will add some logs.
I agree on the "unexpectedly on disk" viewpoint—VFunc does the same. But in the VFunc case if things do not go well you'll have an out-of-memory error, and people will look for offline options because that's what happens in 90% of these structures. The problem with PTRHash is that you do not get out-of-memory—you get slow. People does not take this behavior as a suggestion to change options, since it somehow works.
Hmm very interesting point. So it sounds like going out-of-memory if a preferred default behaviour over both sharding with rereading and sharding to disk. I'll probably just make 3 distinct entry points for these options.
Hmm very interesting point. So it sounds like going out-of-memory if a preferred default behaviour over both sharding with rereading and sharding to disk. I'll probably just make 3 distinct entry points for these options.
"Is" is a big word :). That's my personal take on user psychology. Catastrophic disasters lead to search for solutions. Bad performance leads to think the that's the performance of the software. Sometimes the user will think "maybe I can fiddle", but I think less often than with a catastrophic result.