wren icon indicating copy to clipboard operation
wren copied to clipboard

Symbol table

Open jube opened this issue 4 years ago • 13 comments

It was a TODO in wrenSymbolTableFind: "TODO: O(n). Do something better.". Here is something better in O(log n) thanks to a binary search.

A constraint was to keep the current StringBuffer as the symbols are identified by an index in this buffer. So I added a second buffer where the strings are indirectly ordered. During wrenSymbolTableFind, this second table is used with a binary search, if there is enough elements to outperform linear search (currently 15, which gives good performance, see below).

Some figures:

api_call - wren                .......... 0.05s 0.0022 103.28% relative to baseline
api_foreign_method - wren      .......... 0.25s 0.0034 101.57% relative to baseline
binary_trees - wren            .......... 0.18s 0.0137  97.74% relative to baseline
binary_trees_gc - wren         .......... 0.83s 0.0105 111.33% relative to baseline
delta_blue - wren              .......... 0.11s 0.0085 102.33% relative to baseline
fib - wren                     .......... 0.18s 0.0050 101.50% relative to baseline
fibers - wren                  .......... 0.03s 0.0025 101.82% relative to baseline
for - wren                     .......... 0.06s 0.0010 101.75% relative to baseline
method_call - wren             .......... 0.10s 0.0007 100.28% relative to baseline
map_numeric - wren             .......... 1.03s 0.0116 100.73% relative to baseline
map_string - wren              .......... 0.12s 0.0021 108.19% relative to baseline
string_equals - wren           .......... 0.15s 0.0237 100.77% relative to baseline

This new function performs well with binary_tree_gc because System.gc is one of the last method defined in wren_core.c, so the symbol gc is at the end of the symbol table of the module (175 out of 177 symbols). The linear search has to go through all the symbols whereas the binary search finds it in roughly 7 or 8 comparisons.

jube avatar May 21 '21 13:05 jube

because System.gc is one of the last method defined in wren_core.c, so the symbol gc is at the end of the symbol table of the module (175 out of 177 symbols).

If for some reason this patch isn't viable perhaps there is merit in re-ordering some of these symbols based on frequency of usage? At a minimum it seems moving System.gc to the top of the table might have positive implications for memory heavy workloads...


Yet - Is this symbol table only used once though to find symbols the very first time (and then their index is used) or is it constantly consulted at runtime? I thought I read we purposely don't do that because it would be quite slow? I would assume the first time a compiler sees a symbol it's converted into a direct reference once, and done.

joshgoebel avatar May 21 '21 14:05 joshgoebel

Another strategy would be to reuse ObjMap making a bidirectionnal hash map of String to index.

@joshgoebel I agree about we probably need a way to gather some data about frequency usage. I can think of a few solution to solve the ordering problem:

  • Create a private top level class (not ideal because of its spread)
  • Inject symbols after the creation of the SymbolTable in a dedicated area of the C code (not ideal because it needs C maintenance)
  • Inject symbols after the creation of the SymbolTable in from a config parameter.

mhermier avatar May 21 '21 14:05 mhermier

@joshgoebel System.gc is hardly ever called directly so this one is not really a problem.

I don't think reordering the table would be a good idea because it highly depends on the context (e.g. a math library will use comparisons and operations often, whereas a json library will never use them), and there is a single table for all the method names (not only core but every module that could be loaded) in the vm. The solution is to improve the implementation of the symbol table, which I did.

jube avatar May 21 '21 16:05 jube

@jube for comparison sake, I'll push an alternative pull request based on Map so we can compare numbers.

mhermier avatar May 21 '21 16:05 mhermier

see also https://github.com/wren-lang/wren/pull/782

ruby0x1 avatar May 21 '21 17:05 ruby0x1

@ruby0x1 seems you have better memory than me ^^

mhermier avatar May 21 '21 17:05 mhermier

Yesterday, I re-extract my patch-set for this, and got the following results:

api_call - wren                .......... 0.10s 0.0023  98.77% relative to baseline
api_foreign_method - wren      .......... 0.43s 0.0047  94.44% relative to baseline
binary_trees - wren            .......... 0.30s 0.0038  98.86% relative to baseline
binary_trees_gc - wren         .......... 1.22s 0.0093 100.84% relative to baseline
delta_blue - wren              .......... 0.25s 0.0059 106.55% relative to baseline
fib - wren                     .......... 0.38s 0.0041 104.35% relative to baseline
fibers - wren                  .......... 0.05s 0.0009  99.41% relative to baseline
for - wren                     .......... 0.16s 0.0011  99.23% relative to baseline
method_call - wren             .......... 0.20s 0.0028 104.76% relative to baseline
map_numeric - wren             .......... 1.27s 0.0085 108.42% relative to baseline
map_string - wren              .......... 0.14s 0.0017 102.51% relative to baseline
string_equals - wren           .......... 0.27s 0.0009  98.45% relative to baseline

It took me a while to digest the numbers from both implementation. I think the numbers shows marginal gains due to caching effects, and don't really expose the benefits of the change. Thinking at it, is is obvious that the current benchmarks target runtime execution, while this change targets compilation time, so we need dedicated benchmarks. Readding #782 again, this was what we concluded. At that time I came up with this:

import "meta" for Meta

var str = ""
for (i in 0..640) {
    str = str + "
var val_%(i*100+ 0) = %(i*100+ 0)
var val_%(i*100+ 1) = %(i*100+ 1)
var val_%(i*100+ 2) = %(i*100+ 2)
var val_%(i*100+ 3) = %(i*100+ 3)
var val_%(i*100+ 4) = %(i*100+ 4)
var val_%(i*100+ 5) = %(i*100+ 5)
var val_%(i*100+ 6) = %(i*100+ 6)
var val_%(i*100+ 7) = %(i*100+ 7)
var val_%(i*100+ 8) = %(i*100+ 8)
var val_%(i*100+ 9) = %(i*100+ 9)
var val_%(i*100+10) = %(i*100+10)
var val_%(i*100+11) = %(i*100+11)
var val_%(i*100+12) = %(i*100+12)
var val_%(i*100+13) = %(i*100+13)
var val_%(i*100+14) = %(i*100+14)
var val_%(i*100+15) = %(i*100+15)
var val_%(i*100+16) = %(i*100+16)
var val_%(i*100+17) = %(i*100+17)
var val_%(i*100+18) = %(i*100+18)
var val_%(i*100+19) = %(i*100+19)
var val_%(i*100+20) = %(i*100+20)
var val_%(i*100+21) = %(i*100+21)
var val_%(i*100+22) = %(i*100+22)
var val_%(i*100+23) = %(i*100+23)
var val_%(i*100+24) = %(i*100+24)
var val_%(i*100+25) = %(i*100+25)
var val_%(i*100+26) = %(i*100+26)
var val_%(i*100+27) = %(i*100+27)
var val_%(i*100+28) = %(i*100+28)
var val_%(i*100+29) = %(i*100+29)
var val_%(i*100+30) = %(i*100+30)
var val_%(i*100+31) = %(i*100+31)
var val_%(i*100+32) = %(i*100+32)
var val_%(i*100+33) = %(i*100+33)
var val_%(i*100+34) = %(i*100+34)
var val_%(i*100+35) = %(i*100+35)
var val_%(i*100+36) = %(i*100+36)
var val_%(i*100+37) = %(i*100+37)
var val_%(i*100+38) = %(i*100+38)
var val_%(i*100+39) = %(i*100+39)
var val_%(i*100+40) = %(i*100+40)
var val_%(i*100+41) = %(i*100+41)
var val_%(i*100+42) = %(i*100+42)
var val_%(i*100+43) = %(i*100+43)
var val_%(i*100+44) = %(i*100+44)
var val_%(i*100+45) = %(i*100+45)
var val_%(i*100+46) = %(i*100+46)
var val_%(i*100+47) = %(i*100+47)
var val_%(i*100+48) = %(i*100+48)
var val_%(i*100+49) = %(i*100+49)
var val_%(i*100+50) = %(i*100+50)
var val_%(i*100+51) = %(i*100+51)
var val_%(i*100+52) = %(i*100+52)
var val_%(i*100+53) = %(i*100+53)
var val_%(i*100+54) = %(i*100+54)
var val_%(i*100+55) = %(i*100+55)
var val_%(i*100+56) = %(i*100+56)
var val_%(i*100+57) = %(i*100+57)
var val_%(i*100+58) = %(i*100+58)
var val_%(i*100+59) = %(i*100+59)
var val_%(i*100+60) = %(i*100+60)
var val_%(i*100+61) = %(i*100+61)
var val_%(i*100+62) = %(i*100+62)
var val_%(i*100+63) = %(i*100+63)
var val_%(i*100+64) = %(i*100+64)
var val_%(i*100+65) = %(i*100+65)
var val_%(i*100+66) = %(i*100+66)
var val_%(i*100+67) = %(i*100+67)
var val_%(i*100+68) = %(i*100+68)
var val_%(i*100+69) = %(i*100+69)
var val_%(i*100+70) = %(i*100+70)
var val_%(i*100+71) = %(i*100+71)
var val_%(i*100+72) = %(i*100+72)
var val_%(i*100+73) = %(i*100+73)
var val_%(i*100+74) = %(i*100+74)
var val_%(i*100+75) = %(i*100+75)
var val_%(i*100+76) = %(i*100+76)
var val_%(i*100+77) = %(i*100+77)
var val_%(i*100+78) = %(i*100+78)
var val_%(i*100+79) = %(i*100+79)
var val_%(i*100+80) = %(i*100+80)
var val_%(i*100+81) = %(i*100+81)
var val_%(i*100+82) = %(i*100+82)
var val_%(i*100+83) = %(i*100+83)
var val_%(i*100+84) = %(i*100+84)
var val_%(i*100+85) = %(i*100+85)
var val_%(i*100+86) = %(i*100+86)
var val_%(i*100+87) = %(i*100+87)
var val_%(i*100+88) = %(i*100+88)
var val_%(i*100+89) = %(i*100+89)
var val_%(i*100+90) = %(i*100+90)
var val_%(i*100+91) = %(i*100+91)
var val_%(i*100+92) = %(i*100+92)
var val_%(i*100+93) = %(i*100+93)
var val_%(i*100+94) = %(i*100+94)
var val_%(i*100+95) = %(i*100+95)
var val_%(i*100+96) = %(i*100+96)
var val_%(i*100+97) = %(i*100+97)
var val_%(i*100+98) = %(i*100+98)
var val_%(i*100+99) = %(i*100+99)"
}
System.gc()

var start = System.clock

Meta.eval(str)

System.print("true")
System.print("elapsed: %(System.clock - start)")

It is a deviation of original poster method. It set up a giant string of unique variable names and only benchmark the eval part to only see the effects on compilation of names. If we agree this is a good benchmark, I'll make a pull request of it so we can compare properly implementations.

As a side note, I would indirectly benefit from this in the mirror when looking up for methods from names. So while I agree, compiler speed up benefits could be considered marginal compared to runtime, there is at least a potential future runtime speedup implied by that change.

mhermier avatar May 22 '21 07:05 mhermier

@joshgoebel I don't know if you are the correct person to ask, but maybe you can answer. I was thinking about the test-suite tool that you showed me the other day, and wondered:

  • Does it has a dry run mode ? (loading all the test bun not running any)
  • What the largest test-suite that you have on hands ?

I would like to see how it effects a realistic'ish large scale code size project compile time, if it's measurable enough... It would get better numbers that would help confirm the relevance of my test.

mhermier avatar May 22 '21 10:05 mhermier

Does it has a dry run mode ? (loading all the test bun not running any)

Sure, just call new but not test... I'll be pushing a lot of newer code to it later today hopefully, the version on GitHub is quite old. Well, you don't get the reporter output then, but you could add that?

https://github.com/joshgoebel/wren-testie

You just run the individual test script, there is no "test runner" script...

What the largest test-suite that you have on hands ?

Not sure I have any large projects... Wren Exercism test suite is the largest... Takes about 3 seconds to run the test suite of about ~30 different exercises.

joshgoebel avatar May 22 '21 12:05 joshgoebel

Hmmm I wonder if it is enough, could you have some numbers in terms a wc of what is expected to be loaded. Anyway a dry run speed should be interesting, last time I ran the test it was 9.88s to 0.05s according to the other PR, so there should be already a visible effect. I'll try to bring back the test in my branch to see if the numbers are still relevant. But I would like to see, the current PR already shows improvements, on real code.

mhermier avatar May 22 '21 13:05 mhermier

Seems that the change change is still lot relevant, with a nice 14017% improvement on the test XD:

api_call - wren                .......... 0.10s 0.0011  99.98% relative to baseline
api_foreign_method - wren      .......... 0.42s 0.0156  96.02% relative to baseline
binary_trees - wren            .......... 0.31s 0.0043  96.27% relative to baseline
binary_trees_gc - wren         .......... 1.23s 0.0041  99.78% relative to baseline
compiler_many_var - wren       .......... 0.08s 0.0022 14017.57% relative to baseline
delta_blue - wren              .......... 0.26s 0.0047 100.97% relative to baseline
fib - wren                     .......... 0.37s 0.0042 105.08% relative to baseline
fibers - wren                  .......... 0.05s 0.0008 100.94% relative to baseline
for - wren                     .......... 0.16s 0.0009  99.27% relative to baseline
method_call - wren             .......... 0.20s 0.0055 104.31% relative to baseline
map_numeric - wren             .......... 1.27s 0.0044  99.90% relative to baseline
map_string - wren              .......... 0.13s 0.0016 100.88% relative to baseline
string_equals - wren           .......... 0.26s 0.0012 100.59% relative to baseline

mhermier avatar May 22 '21 16:05 mhermier

What's the impact on memory usage? Compile time is not very important (and mirror API is rarely used, and often its results are cached, too), but the symbol table is used all across the compiler and I'm afraid it'll affect memory usage badly to add a second buffer.

ChayimFriedman2 avatar May 22 '21 20:05 ChayimFriedman2

My solution adds an ObjMap with a size enough to holds String to symbol index, where String is shared with the regular array. So it is relatively cheap. Compile time can be important on one shoot tools, and recovery time, the reason why I look for data of realistic'ish size projects.

mhermier avatar May 22 '21 20:05 mhermier