node icon indicating copy to clipboard operation
node copied to clipboard

Proposal: Add a method to check if a string is a OneByteString

Open theweipeng opened this issue 1 year ago • 7 comments

What is the problem this feature will solve?

In the Buffer module, we have a series of methods for handling string writing and reading, such as latin1WriteStatic and utf8WriteStatic. However, in some situations, we don't know the actual encoding of the string without checking every character. Checking the encoding will introduce overhead, especially when the string is large since SIMD is not accessible on the JavaScript side. In some string processing programs like the serialize framework (https://fury.apache.org/), high performance in string processing is highly beneficial for such programs.

What is the feature you are proposing to solve the problem?

Add isOneByteString function on the javascript side.

function isOneByteString(str) {
    if (typeof str  !== "string") {
        return null;
    }
    return getIsOneByte(str);
}

Add the getIsOneByte function on the C++ side. There is GetIsOneByteSlow for the slow mode, which is used when the place where it is being used cannot be compiled by TurboFan. And there is GetIsOneByteFast for the fast mode. This function is only applicable when the input string is a FastOneByteString, and in such a case, it will return true directly.

void GetIsOneByteSlow(
    const v8::FunctionCallbackInfo<v8::Value>& info) {
  DCHECK(ValidateCallbackInfo(info));
  if (info.Length() != 1 || !info[0]->IsString()) {
    info.GetIsolate()->ThrowError(
        "isOneByteString() requires a single string argument.");
    return;
  }
  bool is_one_byte = Utils::OpenDirectHandle(*info[0].As<v8::String>())
                         ->IsOneByteRepresentation();
  info.GetReturnValue().Set(is_one_byte);
}

bool GetIsOneByteFast(v8::Local<v8::Value> receiver,
                  const v8::FastOneByteString &source) {
  return true;
}

What alternatives have you considered?

No response

theweipeng avatar Nov 30 '24 12:11 theweipeng

Go for it! This seems to have a good shape w/ the shared snippets

juanarbol avatar Dec 01 '24 07:12 juanarbol

I feel that it might need some clarifications about what this util is targeting

  1. If this is for checking "whether the string will hit an internal fast path", then GetIsOneByteSlow() should just return false. Being a one byte string doesn't mean it will actually ensure fast call invocation, as there are other factors to consider (e.g. concatenated strings are still not optimized IIRC). And this should be named something like isFastOneByteString() instead, though that might be disclosing too many implementation details and lead to micro-optimizations that assumes calling into fast calls twice is always faster than calling into normal binding once, which may or may not be true, and doing this extra check can regress the performance in the case where the the string is not fast one byte.
  2. If this is for checking "whether the string only contains Latin-1 characters", then what you should call in the slow path is not IsOneByteRepresentation() but ContainsOnlyOneByte() because IsOneByteRepresentation() can return false when the string contains only Latin-1 characters but for some reason is stored as two-byte.

joyeecheung avatar Dec 03 '24 16:12 joyeecheung

I feel that it might need some clarifications about what this util is targeting

  1. If this is for checking "whether the string will hit an internal fast path", then GetIsOneByteSlow() should just return false. Being a one byte string doesn't mean it will actually ensure fast call invocation, as there are other factors to consider (e.g. concatenated strings are still not optimized IIRC). And this should be named something like isFastOneByteString() instead, though that might be disclosing too many implementation details and lead to micro-optimizations that assumes calling into fast calls twice is always faster than calling into normal binding once, which may or may not be true, and doing this extra check can regress the performance in the case where the the string is not fast one byte.
  2. If this is for checking "whether the string only contains Latin-1 characters", then what you should call in the slow path is not IsOneByteRepresentation() but ContainsOnlyOneByte() because IsOneByteRepresentation() can return false when the string contains only Latin-1 characters but for some reason is stored as two-byte.

Thank you for reminding me. Please correct me if there is any problem with my expression.

My intention is to eliminate the overhead of "encoding checks and byteLength compute" before latin1Write or ucs2Write. Thus, this proposal aims to check "whether the string is store by oneByte or twoByte".

My code will be as follows:

const Flags = {
    latin1,
    utf16
}
function serializeString(buffer, str) {
    let cursor = 0;
    const isOneByte = util.isOneByteString(str);
    if (isOneByte) {
           buffer.writeUint8(Flags.latin1, cursor++);  // write the flags
           buffer.writeUInt32LE(str.length, cursor); // write byteLength
           buffer.write(str, cursor + 4, "latin1");   // memcpy
           cursor += str.length + 4;
    } else {
           buffer.writeUint8(Flags.utf16, cursor++);  // write the flags
           buffer.writeUInt32LE(str.length * 2, cursor); // write byteLength
           buffer.write(str, cursor + 4, "utf16le");   //  memcpy
           cursor += str.length * 2  + 4;
    }
}

So, "whether the string contains only Latin-1 characters" is not crucial in this situation. If it is a oneByte string, I will write it to the buffer as latin1, and if it is a twoByte string, I will write it as utf16.

theweipeng avatar Dec 04 '24 02:12 theweipeng

I see, in that case it should probably be placed as something like v8.isOneByteRepresentation(), as this is very V8-specific, although note that this can start to backfire if V8 changes its string representation schemes (e.g. if one day it decides to just store UTF8, or at least support UTF8 storage). I feel that what you are trying to do, a method that does the serialization + returning the encoding used would be a better encapsulation?

joyeecheung avatar Dec 04 '24 04:12 joyeecheung

I see, in that case it should probably be placed as something like v8.isOneByteRepresentation(), as this is very V8-specific, although note that this can start to backfire if V8 changes its string representation schemes (e.g. if one day it decides to just store UTF8, or at least support UTF8 storage). I feel that what you are trying to do, a method that does the serialization + returning the encoding used would be a better encapsulation?

Yes, this is very V8-specific, and not a good encapsulation. Perhaps we can add a util tool to return the raw information of a string. It contains byteLength(1x or 2x string length) and encoding(latin1 or ucs2). With these info serializer could work in the samy way

theweipeng avatar Dec 04 '24 05:12 theweipeng

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

joyeecheung avatar Dec 04 '24 05:12 joyeecheung

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

Yes, it is same, utf16 is better and more common

theweipeng avatar Dec 04 '24 05:12 theweipeng

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

Hi, I submitted a pull request https://github.com/nodejs/node/pull/56147. Could you take a moment to review it. :)

theweipeng avatar Dec 10 '24 02:12 theweipeng

@joyeecheung @juanarbol I'm sorry to bring this up, but I still want to ask if there's any plan for this or if there are any problems with this PR. :)

theweipeng avatar Dec 11 '24 14:12 theweipeng