natalie icon indicating copy to clipboard operation
natalie copied to clipboard

Bytecode VM: Add sections

Open herwinw opened this issue 11 months ago • 1 comments

This probably needs a rewrite of the prototype of the serialize and deserialize methods of the instructions, so please don't add any new bytecode instructions until this case is merged.

herwinw avatar Mar 30 '24 16:03 herwinw

puts 'hello world!'
puts 'hello world!'
puts 'hello world!'

After implementing the deduplication for strings, I tried this simple and repetitive program. Without sections, the generated bytecode looks like this:

00000000: 4e61 7458 0000 4849 0c68 656c 6c6f 2077  NatX..HI.hello w
00000010: 6f72 6c64 2101 3a01 4f04 7075 7473 0137  orld!.:.O.puts.7
00000020: 4849 0c68 656c 6c6f 2077 6f72 6c64 2101  HI.hello world!.
00000030: 3a01 4f04 7075 7473 0137 4849 0c68 656c  :.O.puts.7HI.hel
00000040: 6c6f 2077 6f72 6c64 2101 3a01 4f04 7075  lo world!.:.O.pu
00000050: 7473 01                                  ts.

With sections and deduplication of strings, it looks like this:

00000000: 4e61 7458 0000 0202 0000 000c 0100 0000  NatX............
00000010: 1d00 0000 0d0c 6865 6c6c 6f20 776f 726c  ......hello worl
00000020: 6421 0000 0029 4849 0001 3a01 4f04 7075  d!...)HI..:.O.pu
00000030: 7473 0137 4849 0001 3a01 4f04 7075 7473  ts.7HI..:.O.puts
00000040: 0137 4849 0001 3a01 4f04 7075 7473 01    .7HI..:.O.puts.

So even with all the overhead of sections, the resulting bytecode is smaller.

A minor caveat here is that nobody writes code like that, but it still feels like progress :laughing:

UPDATE: After deduplication of the method calls:

00000000: 4e61 7458 0000 0202 0000 000c 0100 0000  NatX............
00000010: 2200 0000 120c 6865 6c6c 6f20 776f 726c  ".....hello worl
00000020: 6421 0470 7574 7300 0000 1d48 4900 013a  d!.puts....HI..:
00000030: 014f 0d01 3748 4900 013a 014f 0d01 3748  .O..7HI..:.O..7H
00000040: 4900 013a 014f 0d01                      I..:.O..

herwinw avatar Mar 31 '24 11:03 herwinw

That should do the trick. The rodata section now contains strings and everything symbol-like (things like names of methods and variables). All the deduplication logic is shared based on string representation, so code like puts :puts; puts "puts" only has 1 puts in the rodata section, even though we use it for 3 different purposes. Data retrieval has a cache for all symbols, the second time we try to fetch a symbol we simply return it from the cache instead of reading it again and converting it again to a symbol.

Currently we need to rodata section to appear before the code section, even though the sections header implies that they can be swapped. This has to do with how the instructions are read from the bytecode. I did some testing with bin/natalie --compile-bytecode /dev/stdout $file | bin/natalie --bytecode /dev/stdin, where you don't have random access on this input stream. This is only an issue if you try to hexedit your own bytecode, so I'm going to ignore this issue for now.

herwinw avatar Mar 31 '24 17:03 herwinw

Sweet!

seven1m avatar Mar 31 '24 22:03 seven1m