Cello Benchmarks: improve make gc

We're benchmarking the GC overhead here but having {val=0} rather than {} triggers the hashing of "val" and doubles the work of the GC which must scan each ixx.val.

On my machine, this halves the Lua time (from 21s to 11s), and gets LuaJIT down from 1.63s to ~1s.

The same logic applies probably to gc_javascript.js as well.

As a side note, Lua and LuaJIT use the same GC but the results differ by a large amount, which means that you're still not measuring what you think you're measuring. The LuaJIT compiler may be smart enough to skip the allocations.

Please don't merge this at once, there may be more tweaks coming.

Nov 08 '15 03:11 pygy

Hi Pygy,

Thanks for helping with the benchmarking.

The other language files also have some single int field in the object created in them for the same reason - we actually want the GC to scan all of the heap allocations. So I'm not sure removing this from lua or js is fair in this case.

Saying that - you are correct in that this benchmark is not very effectively testing the GC. For Python or Ruby it seems most of the overhead comes from the actual allocation and creation of the objects rather than the GC.

So if you have an idea for a better design of GC test I'd still be very interested to hear.

Thanks,

Dan

Nov 08 '15 10:11 orangeduck

Tables in Lua are always heap allocated, but having a field in the object means that the GC must look inside them too only to realize that they contain number values that it can ignore.

It is as if you were creating a Cello object that holds a var that ends up pointing to a non-managed object, even though it could point to one, thereby doubling the work of the mark phase.

What more, you're using the hash part of the table to do so. As an optimization, Lua tables use an array to store the values of contiguous integer keys starting at 1. The other keys, numeric or not, are put in a hash table which is slower to access.

Nov 08 '15 13:11 pygy

Okay perhaps it is better to switch around the other languages to require some level of indirection rather than allowing Lua not to have to allocate the integer data. I.E instead of allocating boxed integers we can allocate some structs/objects with a few string or integer fields (or something like that). But I also thought that Lua integers weren't even boxed so I would expect them to exist inside the hash table data structure. In which case this wouldn't require an extra level of indirection.

The most important thing is that the semantics of the program remain the same between languages and in this case the semantics are to allocate an object with some integer data inside so we really can't avoid doing that for Lua. The fact that languages like Lua, Python, Ruby, JS might have to use an extra level of indirection for this requirement is really nothing to do with my test - it is a limitation/problem in their own design/implementations.

Nov 08 '15 17:11 orangeduck

The most important thing is that the semantics of the program remain the same between languages

The intent of the benchmark, as stated here is "to measure a language's Garbage Collection performance".

Yet, for interpreted languages, you're mostly measuring the VM overhead, and in two cases, (Lua and JS since for Python and Ruby you use empty objects) you burden them with additional hashing, allocation and GC/mark work. Lua numbers are tagged values, but the GC must iterate over the keys and values of every table to determine whether it references a collectible object (in this case, it marks the key (a string) and ignores the number).

From a memory stand point an empty Lua table is already heavier than a heap allocated int. All these fields must also be initialized, which means, on top of the struct, that the interpreter must allocate and initialize two arrays (one for the hash part, one for the array part of the table).

In stock Lua, the closest thing to a boxed, heap allocated int would be a light userdata (a boxed pointer), created from either with newproxy (in Lua 5.1) or using a small C extension (for Lua 5.2+).

LuaJIT can allocate boxed ints thanks to its FFI by using local Int = ffi.typeof"struct{ int i; }". If you use a bare int instead of a struct, the compiler optimizes away the allocation, even though the LuaJIT interpreter puts it on the heap.

Note that you can turn off the GC in Lua, but doing so increases the run time with the stock Lua interpreter (maybe it's paging out? or more cache misses?), unless I reduce the main loop count to 1000. Even in that case, the difference is minimal.

With LuaJIT and structs, there's basically no difference with or without GC... Perhaps 20 or 30 ms, but it may be noise.

Still, at allocation time, the new values must be inserted in the GC linked list whether or not the GC is running, and thus stopping the GC won't tell the whole story.

At last, I've adapted the benchmark script to run on my Mac (no GNU flags for time and echo, no Java out of the box, nodejs is node), but I get error messages like this one, even though the Cello test suite passes:

* Cello:

!!
!!  Uncaught ValueError
!!
!!       Pointer '0x7fff55de9898' passed to 'type_of' has bad magic number, perhaps it wasn't allocated by Cello.
!!

The pointers are all of this form: 0x7fffxxxxx8x8.

Any idea?

Nov 10 '15 03:11 pygy

Are you running make bench to run the bencmarks? The Cello binary needs to be compiled without debug information using the -DCELLO_NDEBUG flag.

Nov 10 '15 08:11 orangeduck

Cello
Cello copied to clipboard

Benchmarks: improve make gc_lua.lua

Cello Cello copied to clipboard

Benchmarks: improve make gc_lua.lua

Cello
Cello copied to clipboard