Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication
When formatting Ruby code containing multibyte UTF-8 characters (emojis, Japanese characters, etc.), the plugin corrupts these characters if they happen to fall exactly at the 8192-byte (8KB) boundary in the data stream between the Node.js plugin and Ruby server.
This issue likely originated from commit bd96faf (July 8, 2023) when the socket reading logic was changed to fix JSON parsing for large data. The change may have inadvertently introduced a UTF-8 boundary issue where multibyte characters could be split across chunk boundaries.
Reproduction
# Create a file with exactly 8188 ASCII characters followed by a multibyte character
puts "#{'a' * 8188}😀"
The emoji gets corrupted because it starts at byte 8189 and is split across the 8KB boundary.
Solution
Implemented a length-prefixed protocol for socket communication:
- Client sends a 4-byte length header before the JSON content
- Server reads the exact number of bytes specified in the header
- This ensures complete UTF-8 strings are decoded regardless of chunking
Testing
Added comprehensive test coverage:
- Test case that reproduces the exact 8KB boundary issue
- Verified with various multibyte characters (emojis, Japanese characters)
- Ensures the fix works across different buffer sizes
Impact
- Fixes data corruption for users working with non-ASCII content
- No breaking changes to the API
- Minimal performance impact (4-byte overhead per request)
I also encountered this issue around the same time and had been working on a fix locally. This fix appears to have a smaller impact compared to mine. What do you think?
diff --git a/src/plugin.js b/src/plugin.js
index 71d2030..276551c 100644
--- a/src/plugin.js
+++ b/src/plugin.js
@@ -157,6 +157,7 @@ async function parse(parser, source, opts) {
return new Promise((resolve, reject) => {
const socket = new net.Socket();
+ socket.setEncoding('utf-8');
let chunks = "";
socket.on("error", (error) => {
@@ -164,7 +165,7 @@ async function parse(parser, source, opts) {
});
socket.on("data", (data) => {
- chunks += data.toString("utf-8");
+ chunks += data;
});
socket.on("end", () => {
@yaa I think that solution makes sense. According to https://nodejs.org/api/stream.html#readablesetencodingencoding that seems like it will fix the issue.