klib icon indicating copy to clipboard operation
klib copied to clipboard

Array keys

Open lukego opened this issue 9 years ago • 10 comments

Klib looks really interesting! I would love to use it in our Snabb Switch project. Snabb Switch is a simple and fast IP-networking toolkit based on LuaJIT.

Questions!

  • Is there a LuaJIT binding anywhere for khash and/or ktree?
  • Could these data structures support fixed-size array keys (with embedded zeros)?

I expect that we would want to have hundreds of instances of tables/trees, each with up to around a million items, and key types like int64, char[16], char[40] containing information that we have extracted from network packets (e.g. IPv6 addresses).

I have previously investigated Judy Arrays in 1250 LOC but I did not manage to match that to our needs.

Help to get started would be much appreciated!

lukego avatar Mar 10 '15 10:03 lukego

+1

I also was thinking about some sort of bindings for khash to LuaJIT, so that single hash table could be used from both C and Lua code.

2015-03-10 13:08 GMT+03:00 Luke Gorrie [email protected]:

Klib looks really interesting! I would love to use it in our Snabb Switch https://github.com/SnabbCo/snabbswitch/ project. Snabb Switch is a simple and fast IP-networking toolkit based on LuaJIT.

Questions!

  • Is there a LuaJIT binding anywhere for khash and/or ktree?
  • Could these data structures support fixed-size array keys (with embedded zeros)?

I expect that we would want to have hundreds of instances of tables/trees, each with up to around a million items, and key types like int64, char[16], char[40] containing information that we have extracted from network packets (e.g. IPv6 addresses).

I have previously investigated Judy Arrays in 1250 LOC https://code.google.com/p/judyarray/ but I did not manage to match that to our needs.

Help to get started would be much appreciated!

— Reply to this email directly or view it on GitHub https://github.com/attractivechaos/klib/issues/44.

Regards, Konstantin

Lupus avatar Mar 10 '15 11:03 Lupus

I have no idea about LuaJIT bindings, but you can use a fixed size array as a key with khash if you wrap it in a struct.

This example should work:

// create hash map that maps 16 byte key to char*
typedef struct { char d[16]; } MyKey;
static inline int my_key_hash(MyKey k) { return k.d[0]*3+k.d[1]*7; /* some hash function */ }
static inline int my_keys_eq(MyKey k1, MyKey k2) { return memcmp(k1.d, k2.d, sizeof(k1.d)); }
KHASH_INIT(MyHash, MyKey, char*, 1, my_key_hash, my_keys_eq)

Example Usage:

int ret, is_missing;
khiter_t k;
khash_t(MyHash) *h = kh_init(MyHash);

// some values
MyKey key1 = {.d = {0}}; // key of all zeros
char value1[] = "hello";

// Add
k = kh_put(MyHash, h, key1, &ret);
kh_value(h, k) = value1;

// Find
k = kh_get(MyHash, h, key1);
is_missing = (k == kh_end(h));
kh_del(MyHash, h, k);

kh_destroy(MyHash, h);

noporpoise avatar Mar 10 '15 11:03 noporpoise

Thanks for the feedback! This should get me started :-)

lukego avatar Mar 11 '15 08:03 lukego

@noporpoise This example does not work the way I expect. The code compiles and runs cleanly, but is_missing is true. It looks like the lookup fails. Is there a bug?

lukego avatar Mar 12 '15 15:03 lukego

Ah yes, my bad! Sorry. Can you spot the error:

static inline int my_keys_eq(MyKey k1, MyKey k2) { return memcmp(k1.d, k2.d, sizeof(k1.d)); }

should be:

static inline int my_keys_eq(MyKey k1, MyKey k2) { return !memcmp(k1.d, k2.d, sizeof(k1.d)); }

The equals functions should return non-zero if they match, zero otherwise. Mine was doing the opposite. Fixed by adding a !.

noporpoise avatar Mar 12 '15 17:03 noporpoise

Thanks!

I have also been tweaking around and I have been infected by C macro fever. (I thought I had an immunity but it seems not.)

How about this for generic code to define a bunch of kinds of hashtables, each with a constant-size array key?

static inline array_hash(const char *a, int n) {
  khint_t h = 0;
  while (n--) { h = (h << 5) - h + (khint_t)*a++; }
  return h;
}

#define ARRAY_TABLE_INIT(name, size, khval_t)                        \
  static inline hash_##name##_at(const char *a)  { array_hash(a, size); } \
  static inline equal_##name##_at(const char *a, const char *b) { return memcmp(a, b, 16) == 0; } \
  KHASH_INIT(name, const char *, khval_t, 1, hash_##name##_at, equal_##name##_at)

ARRAY_TABLE_INIT(array16, 16, char*)
ARRAY_TABLE_INIT(array32, 32, char*)
ARRAY_TABLE_INIT(array40, 40, char*)

lukego avatar Mar 12 '15 17:03 lukego

I am such a noob :-). I see that I am mixing up pointers and values here. Thanks again for the getting started tips. Looks like I should read the khash code properly.

lukego avatar Mar 13 '15 05:03 lukego

Round 3!

I have made a first pass through reading khash. I am impressed! I have made a new attempt at making a library front-end that will be easy to access from LuaJIT. (The actual LuaJIT binding is still to be done.) I would be glad to have some review if anybody is inspired to do so :-) I may well still fundamentally misunderstand khash.

// hashtable.c -- Hashtable based on khash
//
// khash is a small, simple, fast hashtable library for C.
// This is a library to make khash useful via the LuaJIT FFI.
// 
// khash makes extensive use of the C preprocessor. The macro
// KHASH_INIT expands into code for a unique hashtable with its own
// type of key, type of value (or none), hash function, and equality
// predicate.
//
// Notable aspects of this design:
//
// 1. Hashtable memory layout is specialized for the type. If you ask
//    for 6-byte keys or 1-byte values then this will translate into
//    an efficiently packed representation in memory.
//
// 2. Hashtable lookups expand into code specialized for the key
//    type. GCC has the opportunity to make optimizations like
//    constant-size loop unrolling when hashing and comparing.
//
// 3. Lookups and insertions expand into code with the appropriate C
//    type declarations for keys and values. You don't need to cast
//    back and forth between the real types and an opaque type like
//    void*.
//
// And here is what that means when we use this from LuaJIT:
//
// 1. Hashtable memory layout being memory-efficient is good.
//
// 2. Lookups being aggressively optimized by GCC is good too. (We are
//    calling the C lookup functions via FFI.)
//
// 3. Specialized C type declarations on lookup/insert is a
//    nuisance. On the Lua level we care about the exact types of keys
//    and values but on the C level we only care about their sizes. So
//    in our C interface we only talk about the size of the keys and
//    values, and we translate this into opaque types for khash.
//
// khash: http://attractivechaos.github.io/klib/

#include <assert.h>
#include <stdio.h>
#include "khash.h"

// Define a type for a fixed-size key.
#define ARRAY_KEY_TYPE(name, size)                                      \
  typedef struct { char k[size]; }__attribute__((packed)) name##_key_t;

// Define a type for a fixed-size value.
#define ARRAY_VALUE_TYPE(name, size)                                    \
  typedef struct { char v[size]; }__attribute__((packed)) name##_val_t;

// Define a hash function for a fixed-size key.
#define ARRAY_HASH_FUNC(name, size)                                     \
  static inline int hash_##name(name##_key_t a) {                       \
    khint_t h = 0;                                                      \
    int i;                                                              \
    for (i = 0; i < size; i++) {                                        \
      h = (h << 5) - h + (khint_t)a.k[i];                               \
    }                                                                   \
    return h;                                                           \
  }

// Define an equality test function for a fixed-size key.
#define ARRAY_EQUAL_FUNC(name, size) \
  static inline int equal_##name(name##_key_t a, name##_key_t b) {      \
    return memcmp(a.k, b.k, size) == 0;                                 \
  }

#define ARRAY_TABLE_INIT(name, keysize, valuesize)                      \
  ARRAY_KEY_TYPE(name, keysize)                                         \
  ARRAY_VALUE_TYPE(name, valuesize)                                     \
  ARRAY_HASH_FUNC(name, keysize)                                        \
  ARRAY_EQUAL_FUNC(name, keysize)                                       \
  KHASH_INIT(name, name##_key_t, name##_val_t, valuesize>0, hash_##name, equal_##name)

// Define a custom hashtable:
//   named char16
//   16-byte keys
//   pointer-size values
ARRAY_TABLE_INIT(char16, 16, sizeof(char*))

int main() {
  int ret, is_missing;
  khiter_t k;
  khash_t(char16) *h = kh_init(char16);

  // some values
  char16_key_t key1 = {.k={0}}; // key of all zeros
  char value1[] = "hello";
  char16_key_t key2 = {.k={1}}; // key of all ones
  char value2[] = "world";

  // Add
  k = kh_put(char16, h, key1, &ret);
  memcpy(&kh_value(h, k), value1, sizeof(value1));
  k = kh_put(char16, h, key2, &ret);
  memcpy(&kh_value(h, k), value2, sizeof(value2));

  // Find
  k = kh_get(char16, h, key1);
  printf("%s", (char*)&kh_value(h, k));
  is_missing = (k == kh_end(h));
  assert(!is_missing);
  k = kh_get(char16, h, key2);
  printf(" %s\n", (char*)&kh_value(h, k));
  is_missing = (k == kh_end(h));
  assert(!is_missing);
  kh_del(char16, h, k);

  kh_destroy(char16, h);
  return 0;
}

lukego avatar Mar 13 '15 07:03 lukego

I added lookup and insert operations as plain functions that should be callable from LuaJIT:

// Define a lookup function for LuaJIT to call.                                                                                   
// The lookup function returns null or a pointer to the value.                                                                    
#define ARRAY_LOOKUP_FUNC(name, size)                                   \
    void *lookup_##name(khash_t(name) *h, name##_key_t key) {           \
      khiter_t v = kh_get(name, h, key);                                  \
      return v == kh_end(h) ? NULL : &kh_value(h, v);                     \
  }

// Define an insert function for LuaJIT to call.                                                                                  
// The insert function returns a pointer to the value for initialization.                                                         
#define ARRAY_INSERT_FUNC(name, size) \
  void *insert_##name(khash_t(name) *h, name##_key_t key) {             \
    int ret;                                                            \
    return &kh_value(h, kh_put(name, h, key, &ret));                    \
  }

The usage is like this:

  // insert values
  memcpy(insert_char16(h, key1), value1, sizeof(value1));
  memcpy(insert_char16(h, key2), value2, sizeof(value1));

  // lookup values
  printf("%s", (char*)lookup_char16(h, key1));
  printf("%s", (char*)lookup_char16(h, key2));

Couple of reflections about using non-tiny keys (e.g. 40-byte structs):

  1. The lookup and insert operations are passing keys by value. This seems like gratuitous extra work compared with passing pointers. However, this is the interface of khash_get().
  2. Storing the keys inline in the keys structure may make resizing the table more expensive (?).

However, it is not clear that either issue represents a practical problem and so I choose not to worry about them for the moment.

lukego avatar Mar 13 '15 09:03 lukego

Guys. This is great. If you get the luajit FFI bindings. I stumbled across it by accident must have missed it on the luajit distribution list. I think it merits a mention on the FFI page for the luajit. I know my problems with luajit normally revolve around the memory limitations so the more pre-built easy ways there are to bypass the problem the easier my life becomes..

joeatbayes avatar May 07 '15 03:05 joeatbayes