plugin-ruby Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication

When formatting Ruby code containing multibyte UTF-8 characters (emojis, Japanese characters, etc.), the plugin corrupts these characters if they happen to fall exactly at the 8192-byte (8KB) boundary in the data stream between the Node.js plugin and Ruby server.

This issue likely originated from commit bd96faf (July 8, 2023) when the socket reading logic was changed to fix JSON parsing for large data. The change may have inadvertently introduced a UTF-8 boundary issue where multibyte characters could be split across chunk boundaries.

Reproduction

# Create a file with exactly 8188 ASCII characters followed by a multibyte character
puts "#{'a' * 8188}😀"

The emoji gets corrupted because it starts at byte 8189 and is split across the 8KB boundary.

Solution

Implemented a length-prefixed protocol for socket communication:

Client sends a 4-byte length header before the JSON content
Server reads the exact number of bytes specified in the header
This ensures complete UTF-8 strings are decoded regardless of chunking

Testing

Added comprehensive test coverage:

Test case that reproduces the exact 8KB boundary issue
Verified with various multibyte characters (emojis, Japanese characters)
Ensures the fix works across different buffer sizes

Impact

Fixes data corruption for users working with non-ASCII content
No breaking changes to the API
Minimal performance impact (4-byte overhead per request)

Jul 01 '25 03:07 y0n0zawa

I also encountered this issue around the same time and had been working on a fix locally. This fix appears to have a smaller impact compared to mine. What do you think?

diff --git a/src/plugin.js b/src/plugin.js
index 71d2030..276551c 100644
--- a/src/plugin.js
+++ b/src/plugin.js
@@ -157,6 +157,7 @@ async function parse(parser, source, opts) {

   return new Promise((resolve, reject) => {
     const socket = new net.Socket();
+    socket.setEncoding('utf-8');
     let chunks = "";

     socket.on("error", (error) => {
@@ -164,7 +165,7 @@ async function parse(parser, source, opts) {
     });

     socket.on("data", (data) => {
-      chunks += data.toString("utf-8");
+      chunks += data;
     });

     socket.on("end", () => {

Jul 20 '25 05:07 yaa

@yaa I think that solution makes sense. According to https://nodejs.org/api/stream.html#readablesetencodingencoding that seems like it will fix the issue.

Jul 30 '25 15:07 kddnewton