Cello icon indicating copy to clipboard operation
Cello copied to clipboard

Bug in Table (hash table) implementation (and potential GC bug)

Open srdjanstipic opened this issue 1 year ago • 2 comments

I was playing with libCello (git commit: dfcd86c9c2d32e09d86effd6cfd7c9b97a0291a0) and I found a strange behaviour in Table implementation. When the table resizes, some values are modified/lost (check the lines with // BUG).

Also, it looks like GC is taking forever (O(n^2) or more) as the PROBLEM SIZE increases.

The code to reproduce the issue is here:

#include <assert.h>
#include "Cello.h"

// count word frequency
int main(int argc, char **argv) {
  (void)argc, (void)argv;
  // wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
  FILE *fp =  fopen("./t8.shakespeare.txt", "r");
  char *line = NULL, *sep = " \f\n\r\t\v";
  var d = new (Table, String, Int), the = $S("the");
  int i = 0, the_count = 0;
  for (size_t len = 0; getline(&line, &len, fp) != EOF;) {
    for (char *tmp = line, *word; (word = strsep(&tmp, sep));) {
      if (i % 10000 == 0) {
        printf("%d\n", i);
      }
      if (i++ > 300000) { // ******* PROBLEM SIZE *********
        goto end;
      }
      the_count += strcmp(*(char**)the, word) == 0;
      if (!mem(d, $S(word))) {    // create if missing
        set(d, $S(word), $I(0));
      }
      *(long *)get(d, $S(word)) += 1; // OK, time 0.127s
      // set(d, $S(word), $I(c_int(get(d, $S(word))) + 1)); // BUG, time 0.261s
      // set(d, $S(word), new(Int, $I(c_int(get(d, $S(word))) + 1))); // BUG, SLOW, time 32.854s
    }
  }
end:
  if (line) {
    free(line);
  }
  if (fp) {
    fclose(fp);
  }
  println("%$ %$\n", $I(the_count), get(d, the));
  assert(the_count == c_int(get(d, the))); // FAIL for BUGs
  return 0;
}

srdjanstipic avatar Oct 02 '24 12:10 srdjanstipic

@orangeduck There appears to be a bug in Table.c because for this simplified test case I get unexpected results:

#include "Cello.h"
int main(int argc, char *argv[]) {

  var d = new (Table, String, Int);
  set(d, $S("a"), $I(1));
  set(d, $S("b"), $I(1));
  set(d, $S("c"), $I(1));
  set(d, $S("d"), $I(1));
  set(d, $S("a"), $I(2));
  set(d, $S("b"), $I(2));
  set(d, $S("c"), $I(2));
  set(d, $S("d"), $I(2));

  foreach (w in d) {
    println("%$ %$", w, get(d, w));
  }
  return 0;
}

output shows duplicated keys "b" and "d":

"b" 1
"d" 1
"c" 2
"d" 2
"b" 2
"a" 2

however, changing to use assign for the existing keys that were added in first pass works:

#include "Cello.h"
int main(int argc, char *argv[]) {

  var d = new (Table, String, Int);
  set(d, $S("a"), $I(1));
  set(d, $S("b"), $I(1));
  set(d, $S("c"), $I(1));
  set(d, $S("d"), $I(1));
  assign(get(d, $S("a")), $I(2));
  assign(get(d, $S("b")), $I(2));
  assign(get(d, $S("c")), $I(2));
  assign(get(d, $S("d")), $I(2));

  foreach (w in d) {
    println("%$ %$", w, get(d, w));
  }
  return 0;
}

output is correct:

"b" 2
"c" 2
"a" 2
"d" 2

This change seems to work but is probably covering up a deeper issue.

 static void Table_Set(var self, var key, var val) {
+  if (mem(self, key)) {
+    assign(get(self, key), val);
+    return;
+  }
   Table_Set_Move(self, key, val, false);
   Table_Resize_More(self);
 }

awmorgan avatar Dec 07 '24 15:12 awmorgan

Yeah looks like there is a bug somewhere...

I wrote this code almost 10 years ago so unfortunately I have no real memory of how it works or what might be wrong. All I really remember is that it is using robin hood hashing, which is also used by Dictionary and the Garbage Collector.

Given it seems this bug is specific to Table (and does not appear in Dictionary or the GC as far as I know) it is probably somehow related to the fact that Table is meant to be like an owning container - so it might not be moving the contents properly during "rehashing" or "set".

orangeduck avatar Dec 07 '24 21:12 orangeduck