plugin-ruby icon indicating copy to clipboard operation
plugin-ruby copied to clipboard

Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication

Open y0n0zawa opened this issue 8 months ago • 2 comments

When formatting Ruby code containing multibyte UTF-8 characters (emojis, Japanese characters, etc.), the plugin corrupts these characters if they happen to fall exactly at the 8192-byte (8KB) boundary in the data stream between the Node.js plugin and Ruby server.

This issue likely originated from commit bd96faf (July 8, 2023) when the socket reading logic was changed to fix JSON parsing for large data. The change may have inadvertently introduced a UTF-8 boundary issue where multibyte characters could be split across chunk boundaries.

Reproduction

# Create a file with exactly 8188 ASCII characters followed by a multibyte character
puts "#{'a' * 8188}😀"

The emoji gets corrupted because it starts at byte 8189 and is split across the 8KB boundary.

Solution

Implemented a length-prefixed protocol for socket communication:

  1. Client sends a 4-byte length header before the JSON content
  2. Server reads the exact number of bytes specified in the header
  3. This ensures complete UTF-8 strings are decoded regardless of chunking

Testing

Added comprehensive test coverage:

  • Test case that reproduces the exact 8KB boundary issue
  • Verified with various multibyte characters (emojis, Japanese characters)
  • Ensures the fix works across different buffer sizes

Impact

  • Fixes data corruption for users working with non-ASCII content
  • No breaking changes to the API
  • Minimal performance impact (4-byte overhead per request)

y0n0zawa avatar Jul 01 '25 03:07 y0n0zawa

I also encountered this issue around the same time and had been working on a fix locally. This fix appears to have a smaller impact compared to mine. What do you think?

diff --git a/src/plugin.js b/src/plugin.js
index 71d2030..276551c 100644
--- a/src/plugin.js
+++ b/src/plugin.js
@@ -157,6 +157,7 @@ async function parse(parser, source, opts) {

   return new Promise((resolve, reject) => {
     const socket = new net.Socket();
+    socket.setEncoding('utf-8');
     let chunks = "";

     socket.on("error", (error) => {
@@ -164,7 +165,7 @@ async function parse(parser, source, opts) {
     });

     socket.on("data", (data) => {
-      chunks += data.toString("utf-8");
+      chunks += data;
     });

     socket.on("end", () => {

yaa avatar Jul 20 '25 05:07 yaa

@yaa I think that solution makes sense. According to https://nodejs.org/api/stream.html#readablesetencodingencoding that seems like it will fix the issue.

kddnewton avatar Jul 30 '25 15:07 kddnewton