buffer icon indicating copy to clipboard operation
buffer copied to clipboard

Use TextDecoder for toString('utf8')

Open mischnic opened this issue 4 years ago • 4 comments

Closes #268

This uses TextDecoder for toString('utf8') and toString().

I needed to update some tests so that they are in line with Node's native Buffer (which also makes them pass with TextDecoder), I hope this was correct?


Technically, it also supports latin1, utf-16le, but the conversion is different from Node for strings that aren't representable in these encodings:

latin1:

Buffer content: 'Ö' = <UTF8 Buffer c3 96>
Output:
 TextDecoder latin1: 'Ö' <Buffer c3 13>
 Node Buffer latin1: 'Ö' <Buffer c3 96>

utf16: (TextDecoder adds an "�" at the end. https://www.compart.com/en/unicode/U+FFFD)

Buffer content: 'abc' = <UTF8 Buffer 61 62 63>
Output:
 TextDecoder utf-16le: e6 89 a1 ef bf bd
 Node utf16le:         e6 89 a1

mischnic avatar Jan 07 '21 10:01 mischnic

Great initiative @mischnic. I am using Buffer as a drop-in replacement for Node's versions. Changing the tests wouldn't work for me as then Buffer couldn't be used for that. Would it be possible to adjust the output of decoderUTF8.decode(buf.slice(start, end)) to adjust for this case? Maybe remove the "�" with utf16le and replace 0x13 with 0x96 with lading encoding?

martinheidegger avatar Jan 11 '21 05:01 martinheidegger

I only adjusted those test which were apparently deviating from Node's Buffer. For example try running this

> new Buffer([0xF4, 0x8F, 0x80]).toString().length
1

so this test was apparently wrong

  t.equal(
    new B([0xF4, 0x8F, 0x80]).toString(),
    '\uFFFD\uFFFD\uFFFD'
  )

I only used TextDecoder for utf8 because it seems to align with Buffer.toString("utf8"). The handling of the other encodings (utf16, latin) is still the same.

mischnic avatar Jan 11 '21 09:01 mischnic

@feross ?

mischnic avatar Jun 05 '21 22:06 mischnic

There is apparently some breakeven point where TextDecoder becomes faster then the existing implementation:

Using node perf/readUtf8.js, testing 256 byte buffers and new Buffer('7c'.repeat(5e7), 'hex') for the "big" variants

master:
	BrowserBuffer#readUtf8 x 414,259 ops/sec ±2.91% (85 runs sampled)
	NodeBuffer#readUtf8 x 486,114 ops/sec ±3.01% (84 runs sampled)
	BrowserBuffer#readUtf8 big x 0.98 ops/sec ±5.58% (7 runs sampled)
	NodeBuffer#readUtf8 big x 34.31 ops/sec ±1.56% (58 runs sampled)

this:
	BrowserBuffer#readUtf8 x 195,525 ops/sec ±2.06% (86 runs sampled)
	NodeBuffer#readUtf8 x 486,587 ops/sec ±2.24% (79 runs sampled)
	BrowserBuffer#readUtf8 big x 18.61 ops/sec ±11.77% (38 runs sampled)
	NodeBuffer#readUtf8 big x 35.19 ops/sec ±1.76% (61 runs sampled)

mischnic avatar Dec 08 '21 10:12 mischnic