high load factor
bytell_hash_set's rehash is not expected load factor. My flat emhash can set load factor to 1.0 I have update dynamic_rehash_effect.cc.
static void high_load()
{
size_t maxSize = 1U << 28;
size_t numReps = 100;
std::random_device rd;
auto dis = std::uniform_int_distribution<uint32_t>{0, (1U << 31) - 1};
for (size_t rep = 0; rep < numReps; ++rep) {
{
auto rng = std::mt19937(rep);
//https://github.com/ktprime/emhash/blob/master/hash_set4.hpp
emhash9::HashSet<uint32_t> set;
set.max_load_factor(0.999);
auto t1 = getus();
while (set.size() < maxSize) {
auto key = dis(rng);
size_t prevCap = set.bucket_count();
set.insert(key);
if (set.bucket_count() > prevCap) {
size_t prevSize = set.size() - 1;
auto lf = static_cast<double>(prevSize) / prevCap;
std::cout << prevCap << " " << prevSize << " " << lf << "\n";
}
}
std::cout << "emhash loop " << rep << " time use " << (getus() - t1) / 1000000.000 << " sec\n";
}
{
auto rng = std::mt19937(rep);
ska::bytell_hash_set<uint32_t> set;
set.max_load_factor(0.999);
auto t1 = getus();
while (set.size() < maxSize) {
auto key = dis(rng);
size_t prevCap = set.bucket_count();
set.insert(key);
if (set.bucket_count() > prevCap) {
size_t prevSize = set.size();
auto lf = static_cast<double>(prevSize) / prevCap;
std::cout << prevCap << " " << prevSize << " " << lf << "\n";
}
}
std::cout << "skaset loop " << rep << " time use " << (getus() - t1) / 1000000.000 << "sec\n\n";
}
}
}
//the result.
8 8 1 16 16 1 32 32 1 64 64 1 128 128 1 256 256 1 512 512 1 1024 1023 0.999023 2048 2046 0.999023 4096 4092 0.999023 8192 8184 0.999023 16384 16368 0.999023 32768 32736 0.999023 65536 65471 0.999008 131072 130941 0.999001 262144 261882 0.999001 524288 523764 0.999001 1048576 1047528 0.999001 2097152 2095055 0.999 4194304 4190110 0.999 8388608 8380220 0.999 16777216 16760439 0.999 33554432 33520878 0.999 67108864 67041756 0.999 134217728 134083510 0.999 268435456 268167021 0.999 emhash loop 0 time use 59.2812 sec
0 1 inf 16 16 1 32 32 1 64 64 1 128 128 1 256 255 0.996094 512 509 0.994141 1024 1014 0.990234 2048 1973 0.963379 4096 3929 0.959229 8192 7745 0.945435 16384 15540 0.948486 32768 30946 0.944397 65536 61592 0.939819 131072 121771 0.929039 262144 244278 0.931847 524288 485419 0.925863 1048576 963137 0.918519 2097152 1916894 0.914046 4194304 3790756 0.903787 8388608 7541070 0.898966 16777216 14901313 0.888187 33554432 29881340 0.890533 67108864 57400366 0.855332 134217728 115062312 0.857281 268435456 230845682 0.859967 skaset loop 0 time use 49.4685sec
Sorry, I am not sure what you are asking for in the issue? Is there something you want to change or some other behavior you would prefer we use?
Sorry, I am not sure what you are asking for in the issue? Is there something you want to change or some other behavior you would prefer we use?
sorry for no any issuse, I just want to say bytell_hash_set high load factor(0.999) is not accurate during rehash in test.
I believe @ktprime is saying that a proper implementation of unordered set can rehash on insert only if load_factor() would otherwise become greater than max_load_factor(). In other words, the following test MUST pass for any Set that models std::unordered_set:
template <class Set>
void Test(Set& set, const typename Set::value_type & value) {
size_t bucket_count = set.bucket_count();
set.insert(value);
if (bucket_count == set.bucket_count()) return;
double would_be_load_factor = 1. * set.size() / bucket_count;
assert(would_be_load_factor > set.max_load_factor());
}
(Let's ignore potential issue due to floating point precision.)
I don't think this behavior is required by the standard. Here's what it says about max_load_factor():
Returns: A positive number that the container attempts to keep the load factor less than or equal to. The container automatically increases the number of buckets as necessary to keep the load factor below this number.
If my reading is correct, this sentence is saying that load_factor() <= max_load_factor() MUST be true. It does not, however, prohibit rehashing when it's not strictly required. This can be a useful property though: it would give users more control over load factor and would allow one to tell in advance whether insert() will rehash or not. Implementations of classes that model std::unordered_set are free to document whether they provide the extra guarantee. In the absence of explicit documentation on this front we should assume only what the standard says.
Most flat hash map can't set high load factor for performance issue, but my emhash7 can set load factor to 0.999. The following benchmark shows that emhash has amazing performance even insert & erase with a high load factor.
#include "hash_table7.hpp"
static void TestHighLoadFactor()
{
std::random_device rd;
const auto rand_key = rd();
std::mt19937_64 srngi(rand_key), srnge(rand_key);
const auto max_lf = 0.9990f; //<= 0.9999f
const auto vsize = 1u << (20 + rand_key % 6);//must be power of 2
emhash7::HashMap<int64_t, int> myhash(vsize, max_lf);
assert(myhash.bucket_count() == vsize); //no rehash
auto nowus = getus();
for (size_t i = 0; i < size_t(vsize * max_lf); i++)
myhash.emplace(srngi(), i);
assert(myhash.bucket_count() == vsize); //no rehash
//while (myhash.load_factor() < max_lf - 1e-3) myhash.emplace(srngi(), 0);
const auto insert_time = getus() - nowus; nowus = getus();
//erase & insert at a fixed load factor
for (size_t i = 0; i < vsize; i++) {
myhash.erase(srnge()); //erase a exist old key
myhash[srngi()] = 1;
}
const auto erase_time = getus() - nowus;
printf("vsize = %d, load factor = %.5f, insert/erase = %ld/%ld ms\n",
vsize, myhash.load_factor(), insert_time / 1000, erase_time / 1000);
assert(myhash.bucket_count() == vsize); //no rehash
assert(myhash.load_factor() >= max_lf - 0.001);
}
vsize = 1048576, load factor = 0.9990, insert/erase = 25/76 ms
vsize = 2097152, load factor = 0.9990, insert/erase = 52/222 ms
vsize = 4194304, load factor = 0.9990, insert/erase = 117/450 ms
vsize = 8388608, load factor = 0.9990, insert/erase = 251/1009 ms