natalie
natalie copied to clipboard
Bytecode VM: Add sections
This probably needs a rewrite of the prototype of the serialize and deserialize methods of the instructions, so please don't add any new bytecode instructions until this case is merged.
puts 'hello world!'
puts 'hello world!'
puts 'hello world!'
After implementing the deduplication for strings, I tried this simple and repetitive program. Without sections, the generated bytecode looks like this:
00000000: 4e61 7458 0000 4849 0c68 656c 6c6f 2077 NatX..HI.hello w
00000010: 6f72 6c64 2101 3a01 4f04 7075 7473 0137 orld!.:.O.puts.7
00000020: 4849 0c68 656c 6c6f 2077 6f72 6c64 2101 HI.hello world!.
00000030: 3a01 4f04 7075 7473 0137 4849 0c68 656c :.O.puts.7HI.hel
00000040: 6c6f 2077 6f72 6c64 2101 3a01 4f04 7075 lo world!.:.O.pu
00000050: 7473 01 ts.
With sections and deduplication of strings, it looks like this:
00000000: 4e61 7458 0000 0202 0000 000c 0100 0000 NatX............
00000010: 1d00 0000 0d0c 6865 6c6c 6f20 776f 726c ......hello worl
00000020: 6421 0000 0029 4849 0001 3a01 4f04 7075 d!...)HI..:.O.pu
00000030: 7473 0137 4849 0001 3a01 4f04 7075 7473 ts.7HI..:.O.puts
00000040: 0137 4849 0001 3a01 4f04 7075 7473 01 .7HI..:.O.puts.
So even with all the overhead of sections, the resulting bytecode is smaller.
A minor caveat here is that nobody writes code like that, but it still feels like progress :laughing:
UPDATE: After deduplication of the method calls:
00000000: 4e61 7458 0000 0202 0000 000c 0100 0000 NatX............
00000010: 2200 0000 120c 6865 6c6c 6f20 776f 726c ".....hello worl
00000020: 6421 0470 7574 7300 0000 1d48 4900 013a d!.puts....HI..:
00000030: 014f 0d01 3748 4900 013a 014f 0d01 3748 .O..7HI..:.O..7H
00000040: 4900 013a 014f 0d01 I..:.O..
That should do the trick. The rodata
section now contains strings and everything symbol-like (things like names of methods and variables). All the deduplication logic is shared based on string representation, so code like puts :puts; puts "puts"
only has 1 puts
in the rodata
section, even though we use it for 3 different purposes.
Data retrieval has a cache for all symbols, the second time we try to fetch a symbol we simply return it from the cache instead of reading it again and converting it again to a symbol.
Currently we need to rodata
section to appear before the code
section, even though the sections header implies that they can be swapped. This has to do with how the instructions are read from the bytecode. I did some testing with bin/natalie --compile-bytecode /dev/stdout $file | bin/natalie --bytecode /dev/stdin
, where you don't have random access on this input stream. This is only an issue if you try to hexedit your own bytecode, so I'm going to ignore this issue for now.
Sweet!