nodejs-bigtable icon indicating copy to clipboard operation
nodejs-bigtable copied to clipboard

Reading from BigTable adds unicode replacement characters to stored bytearray

Open avosirenfal opened this issue 3 years ago • 0 comments

  • OS: Windows 10
  • Node.js version: 16.13.2
  • npm version: 8.3.2
  • @google-cloud/bigtable version: 4.0.0

Some certain Uint8Array will store correctly to BigTable, but when attempting to retrieve that column the data will be polluted seemingly by incorrect UTF-8 decoding. When retrieving the same column in Python the data is as expected.

Test case:

const bigtable = new Bigtable()
const instance = bigtable.instance("myinstance");
const table = instance.table("mytable");

const COLUMN_FAMILY = "myfamily";
const COLUMN = "mycolumn";

async function bigtable_test() {
	const data = Uint8Array.from([0x1f, 0x8b, 0x8, 0x0, 0x34, 0x30, 0x8, 0x63, 0x0, 0x3, 0xa5, 0x92, 0x3b, 0x6e, 0x54, 0x41, 0x10, 0x45, 0x7, 0x82,
		0x21, 0x61, 0x25, 0x44, 0x55, 0xdd, 0xd5, 0x3f, 0x42, 0x42, 0xc4, 0x26, 0xea, 0xd7, 0xf6, 0x13, 0x9e, 0x8f, 0xe6, 0x3d, 0x4b, 0xb3, 0x5, 0x44, 0xc4,
		0x1e, 0x0, 0xdb, 0x72, 0x0, 0xbb, 0x70, 0xe6, 0xdc, 0x88, 0xc5, 0xd0, 0x58, 0xb2, 0xe4, 0xc4, 0x92, 0x25, 0x57, 0xd8, 0x5d, 0x75, 0xeb, 0xea, 0xdc,
		0x14, 0xb2, 0x90, 0xf6, 0xc6, 0x35, 0x16, 0xea, 0xa5, 0x68, 0x48, 0xa2, 0x2a, 0x9, 0x1, 0x8c, 0x5a, 0xd4, 0xd4, 0xea, 0xef, 0x47, 0x9a, 0xf7, 0x5b,
		0xd6, 0x57, 0xf7, 0xc6, 0x1f, 0xf4, 0xad, 0xc9, 0x0, 0x9c, 0x20, 0x55, 0xcc, 0x99, 0x72, 0xe8, 0x2d, 0x68, 0x97, 0x6, 0x98, 0x23, 0x5a, 0x34, 0xe9,
		0x2, 0x0, 0x0])

	await table.insert([
		{
			key: `some_key`,
			data: {
				[COLUMN_FAMILY]: {
					[COLUMN]: {
						timestamp: new Date(),
						value: data
					},
				}
			}
		}
	])

	const [row] = await table.row('some_key').get();
	console.log(typeof(row.data[COLUMN_FAMILY][COLUMN][0].value))
	console.log(data.length + " versus " + row.data[COLUMN_FAMILY][COLUMN][0].value.length)
	console.log(Buffer.from(row.data[COLUMN_FAMILY][COLUMN][0].value))
}

Run that first, and note that:

  • The returned value is a string.
  • The string's length differs from the orginal byte array's length.
  • Notably, the second, third, and fourth characters in the returned string are 0xef, 0xbf, 0xbd which is the unicode replacement character.

My results:

string 
123 versus 119                                                                                                                                                                    
<Buffer 1f ef bf bd 08 00 34 30 08 63 00 03 ef bf bd ef bf bd 3b 6e 54 41 10 45 07 ef bf bd 21 61 25 44 55 ef bf bd ef bf bd 3f 42 42 ef bf bd 26 ef bf bd ef ... 172 more bytes> 

Then, in Python:

>>> client = bigtable.Client()

>>> instance = client.instance("myinstance")
>>> table = instance.table("mytable")

>>> COLUMN_FAMILY = "myfamily"
>>> COLUMN = b"mycolumn"

>>> row = table.read_row('some_key'.encode('utf-8'))

>>> len(row.cells[COLUMN_FAMILY][COLUMN][0].value)
123

>>> row.cells[COLUMN_FAMILY][COLUMN][0].value
b'\x1f\x8b\x08\x0040\x08c\x00\x03\xa5\x92;nTA\x10E\x07\x82!a%DU\xdd\xd5?BB\xc4&\xea\xd7\xf6\x13\x9e\x8f\xe6=K\xb3\x05D\xc4\x1e\x00\xdbr\x00\xbbp\xe6\xdc\x88\xc5\xd0X\xb2\xe4\xc4\x92%W\xd8]u\xeb\xea\xdc\x14\xb2\x90\xf6\xc65\x16\xea\xa5hH\xa2*\t\x01\x8cZ\xd4\xd4\xea\xefG\x9a\xf7[\xd6W\xf7\xc6\x1f\xf4\xad\xc9\x00\x9c U\xcc\x99r\xe8-h\x97\x06\x98#Z4\xe9\x02\x00\x00'

We see that:

  • The returned data is the correct length.
  • The data matches the original bytearray in JavaScript.

So this seems to be some type issue in the Node BigTable library that affects both createReadStream and table.row and causes the data to be decoded as UTF-8, despite the column not being a string.

We can see that if we were to call:

console.log(Buffer.from(new TextDecoder("utf-8").decode(data))

We get the same incorrect result as the Node.JS library:

<Buffer 1f ef bf bd 08 00 34 30 08 63 00 03 ef bf bd ef bf bd 3b 6e 54 41 10 45 07 ef bf bd 21 61 25 44 55 ef bf bd ef bf bd 3f 42 42 ef bf bd 26 ef bf bd ef ... 172 more bytes> 

avosirenfal avatar Aug 26 '22 07:08 avosirenfal