write-a-hash-table icon indicating copy to clipboard operation
write-a-hash-table copied to clipboard

`ht_get_hash` not working correctly

Open ElginCahangirov opened this issue 3 years ago • 6 comments

ht_get_hash returns the same slot subsequently up to very high attempts in some cases. Example:

(gdb) p ht_get_hash("mybrujmaonutk", 211, 0)
$32 = 38
(gdb) p ht_get_hash("mybrujmaonutk", 211, 1)
$33 = 38
(gdb) p ht_get_hash("mybrujmaonutk", 211, 2)
$34 = 38
// and so on up to ...
(gdb) p ht_get_hash("mybrujmaonutk", 211, 10177647)
$35 = 38
(gdb) p ht_get_hash("mybrujmaonutk", 211, 10177648)
$36 = -13

My definitions:

static int ht_hash(const char* s, const int a, const int m) {
    long hash = 0;
    const int len_s = strlen(s);
    for (int i = 0; i < len_s; i++) {
        hash += (long)pow(a, len_s - (i+1)) * s[i];
        hash = hash % m;
    }
    return (int)hash;
}

static int ht_get_hash(
    const char* s, const int num_buckets, const int attempt
) {
    const int hash_a = ht_hash(s, HT_PRIME_1, num_buckets);
    const int hash_b = ht_hash(s, HT_PRIME_2, num_buckets);

    return (hash_a + (attempt * (hash_b + 1))) % num_buckets;
}

ElginCahangirov avatar Dec 03 '22 11:12 ElginCahangirov

What have you set HT_PRIME_1 and HT_PRIME_2 to? I am facing the same issue

nouritsu avatar Apr 19 '23 12:04 nouritsu

I have found the solution to this issue. Make sure the HT_PRIME values are as follows ->

#define HT_PRIME_1 151;
#define HT_PRIME_2 163;

The ht_get_hash function no longer returns negative values and also works in far fewer attempts.

nouritsu avatar Apr 20 '23 04:04 nouritsu

@nouritsu my HT_PRIME_1 and HT_PRIME_2 were set to 129 and 131. After changing them to 151 and 163, the issue still persists. See my driver code below for reproducing:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include "hash_table.h"

char* rand_word(int min_len, int max_len) {
    int i, length;
    char c;

    // seed random
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    /* using nano-seconds instead of seconds */
    srand((time_t)ts.tv_nsec);

    length = rand() % (max_len - min_len + 1) + min_len;
    char* word = calloc((size_t)(length+1), sizeof(char));

    for (i = 0; i < length; i++) {
        c = 'a' + rand() % 26;
        word[i] = c;
    }
    word[i] = 0;

    return word;
}

int main() {
    ht_hash_table* ht = ht_new();

    for (int i = 0; i < 5000; i++) {
        char *k = rand_word(2, 30);
        char *v = rand_word(10, 100);
        ht_insert(ht, k, v);
    }

    ht_del_hash_table(ht);

    return 0;
}

ElginCahangirov avatar Apr 21 '24 09:04 ElginCahangirov

Do you ever get out-of-bounds writes?

I see you get negative indices🤔

winterrdog avatar Apr 27 '24 06:04 winterrdog

@winterrdog yes, program ends with "Segmentation fault" in some cases (not always though). It means that negative index works in some cases (or at least it doesn't cause segmentation fault) while not in others. I have no idea how this happens.

ElginCahangirov avatar Apr 30 '24 14:04 ElginCahangirov

ht_get_hash returns the same slot subsequently up to very high attempts in some cases. Example:

(gdb) p ht_get_hash("mybrujmaonutk", 211, 0)
$32 = 38
(gdb) p ht_get_hash("mybrujmaonutk", 211, 1)
$33 = 38
(gdb) p ht_get_hash("mybrujmaonutk", 211, 2)
$34 = 38
// and so on up to ...
(gdb) p ht_get_hash("mybrujmaonutk", 211, 10177647)
$35 = 38
(gdb) p ht_get_hash("mybrujmaonutk", 211, 10177648)
$36 = -13

My definitions:

static int ht_hash(const char* s, const int a, const int m) {
    long hash = 0;
    const int len_s = strlen(s);
    for (int i = 0; i < len_s; i++) {
        hash += (long)pow(a, len_s - (i+1)) * s[i];
        hash = hash % m;
    }
    return (int)hash;
}

static int ht_get_hash(
    const char* s, const int num_buckets, const int attempt
) {
    const int hash_a = ht_hash(s, HT_PRIME_1, num_buckets);
    const int hash_b = ht_hash(s, HT_PRIME_2, num_buckets);

    return (hash_a + (attempt * (hash_b + 1))) % num_buckets;
}

the reason is because of an edge case that was not handled with hash_b i.e. when hash_b is equivalent to number of buckets. so i dealt with it in my code like so. When it goes unhandled it can lead very weird hashes that lead to negative indices which should never happen at all.

You could deal with it in two ways e.g. either by always checking if the calculated index is within bounds of the table's items member before using the index to access any element in items member. sth like this:

static int is_valid_index(ht_hash_table* table, int index)
{
    return index >= 0 && index < table->size;
}

or

do a single check like the one i did in the ht_get_hash function in my code repo

winterrdog avatar Apr 30 '24 16:04 winterrdog