bert
bert copied to clipboard
Make the BERT protocol encoding aware
This PR makes the BERT protocol aware of Ruby encodings. I have updated the Encoder and the Decoder as follows:
Decoder
The decoder now looks for a magic header and changes parsing styles based on the header it finds. If the header is MAGIC, then it knows that the encoder didn't add any encoding information and uses the old decoder. If the header is VERSION2, then it knows that the encoder added string encoding information and it adds the encoding information to the string.
Encoder
The encoder is split in to version 1 and version 2 encoding. It defaults to version 1, and version 2 is enabled with a feature flag. The reason I did this is so that we can deploy the updated bert gem to both the client and the server, then flip the feature flag on for the client, then flip it on for the server. That way we can upgrade without downtime.
Here is an IRB session to demonstrate the encoding support:
[aaron@TC bert (encoding)]$ irb -I lib:ext
irb(main):001:0> require 'bert'
=> true
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
irb(main):004:0> BERT::Encode.version = :v2
=> :v2
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:UTF-8>
irb(main):006:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "日本語"
irb(main):007:0>
/cc @brianmario @tmm1 @charliesome @piki
This is a good direction. Any measurements for how much of a performance hit we take with any sort of representative workload? This change essentially doubles the network, RAM, and CPU cost of short strings. How much difference does it make in a real RPC scenario?
If performance is at all an issue, we could indicate common encodings (I expect to see a lot of UTF-8) with something cheaper than a second String.
If performance is at all an issue, we could indicate common encodings (I expect to see a lot of UTF-8) with something cheaper than a second String.
In mochilo we used genperf to create a perfect hash lookup table mapping Ruby encoding names to integer values. I know using the actual encoding id from ruby itself isn't deterministic (depends on load order or something?) but maybe we could do something similar to mochilo one-time and as Ruby adds new encodings we can add them to the end of the list as new ids?
That way we could use a single byte to represent the encoding and map it to/from the encoding name on the server and client at [de]serialization time? That is, if performance is a big issue here...
It looks like finding an encoding by name is just a hash lookup (given the encoding has been loaded once already), so I don't think we need to worry about the speed of finding the encoding object.
I agree that representing the encoding as an integer would reduce the payload size, but I don't know if it matters. Using MIME would require even more network chatter than this. If we could guarantee that all strings in one payload were of the same type, then we could really reduce data. But since strings in Ruby are allowed to have different encodings, I don't know if that's something we can / want to enforce.
It's the number and collective size of allocations that concerns me. Grabbing a list of all the refs in a repository, or all the refs on a particular branch, could be hundreds of thousands of small strings. This change just doubles the serialization, deserialization, and allocation costs. An int would definitely be cheaper.
@piki do you know if payloads usually have homogenous encodings? If stuff is usually UTF-8, maybe we could only add extra encoding information if it's not UTF-8. If strings are usually UTF-8, then we'd only incur overhead for "special" strings.
I would expect 90% of the cases to be utf8 strings, since that's the encoding we use (assume) when we extract Git data via rugged, or to be raw binary like from spawning a child process.
Having a way to specify both of these via a single int should help a lot with the smaller strings (and would still allow us to specify a different encoding if it's neither of these).
Ok, I added two new types to the protocol. One type is for UTF-8 strings, and one type is for "other" encoded strings. Strings tagged as ASCII-8BIT will use the BIN type, so there's no change in protocol overhead for them. The only strings that have additional overhead now are strings that are tagged with some encoding other than UTF8, US-ASCII, and ASCII-8BIT.
Here's a demo of UTF8 strings:
[aaron@TC bert (encoding)]$ irb -I lib
irb(main):001:0> require 'bert'
=> true
irb(main):002:0> BERT::Encoder.encode("日本語").bytesize
=> 15
irb(main):003:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
irb(main):005:0> BERT::Encode.version = :v2
=> :v2
irb(main):006:0> BERT::Encoder.encode("日本語").bytesize
=> 15
irb(main):007:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:UTF-8>
irb(main):008:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "日本語"
irb(main):009:0>
Binary strings:
[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("日本語".b).bytesize
=> 15
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".b)).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("日本語".b).bytesize
=> 15
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".b)).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):006:0>
US-ASCII:
[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("teelo".encode('US-ASCII')).bytesize
=> 11
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("teelo".encode('US-ASCII'))).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("teelo".encode('US-ASCII')).bytesize
=> 11
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("teelo".encode('US-ASCII'))).encoding
=> #<Encoding:UTF-8>
irb(main):006:0>
And finally Shift_JIS:
[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("日本語".encode('Shift_JIS')).bytesize
=> 12
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".encode('Shift_JIS'))).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("日本語".encode('Shift_JIS')).bytesize
=> 26
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".encode('Shift_JIS'))).encoding
=> #<Encoding:Shift_JIS>
irb(main):006:0>
Note that US-ASCII gets lifted to UTF8, but I think that doesn't matter, and the Shift_JIS string is a bit longer when encoded because it embeds the encoding name.
Nice work!
Note that US-ASCII gets lifted to UTF8, but I think that doesn't matter
:+1: to that
I like it. :+1:
Looks great ✨