ethers.js icon indicating copy to clipboard operation
ethers.js copied to clipboard

`ethers.utils.defaultAbiCoder.decode` can not decode if a string has certain chars in it

Open joshstevens19 opened this issue 2 years ago • 3 comments

Ethers Version

@ethersproject/abi::5.6.4

Search Terms

abi, decoder

Describe the Problem

Hey, first off ethers is the best thanks for all you do @ricmoo!

Issue

When you try to decode a log that has certain chars in the string like � the decoder fails with an error:

thrown: [null: invalid codepoint at offset 97; unexpected continuation byte (argument="bytes", value=Uint8Array....

This then causes a bailout if you are using ethers to decode anything which may contain this. You can see the successful decoding from tenderly here

here is an example unit test one with normal chars in the data tx here and one with invalid char data.. tx here

Working test

Here we decode the unindexed log info, you can see the tx here PostCreated event.

import { ethers } from 'ethers';

it('example working', () => {
  const unindexedData = [
    { indexed: false, internalType: 'string', name: 'contentURI', type: 'string' },
    { indexed: false, internalType: 'address', name: 'collectModule', type: 'address' },
    { indexed: false, internalType: 'bytes', name: 'collectModuleReturnData', type: 'bytes' },
    { indexed: false, internalType: 'address', name: 'referenceModule', type: 'address' },
    { indexed: false, internalType: 'bytes', name: 'referenceModuleReturnData', type: 'bytes' },
    { indexed: false, internalType: 'uint256', name: 'timestamp', type: 'uint256' },
  ];
  const workingData =
    '0x00000000000000000000000000000000000000000000000000000000000000c000000000000000000000000023b9467334beb345aaa6fd1545538f3d54436e960000000000000000000000000000000000000000000000000000000000000140000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001800000000000000000000000000000000000000000000000000000000062e6cbac000000000000000000000000000000000000000000000000000000000000005068747470733a2f2f646174612e6c656e732e7068617665722e636f6d2f6170692f6c656e732f706f7374732f65323936633839662d353364652d346332632d623237372d37306163653361643632336100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000';

  const result = ethers.utils.defaultAbiCoder.decode(
    unindexedData.map((i) => i.type),
    workingData
  );
 expect(result).toEqual([
    'https://data.lens.phaver.com/api/lens/posts/e296c89f-53de-4c2c-b277-70ace3ad623a',
    '0x23b9467334bEb345aAa6fd1545538F3d54436e96',
    '0x0000000000000000000000000000000000000000000000000000000000000001',
    '0x0000000000000000000000000000000000000000',
    '0x',
    { _hex: '0x62e6cbac', _isBigNumber: true },
  ]);
});

Broken test

Here we decode the unindexed log info, you can see the tx here PostCreated event.

it('example not working', () => {
  const unindexedData = [
    { indexed: false, internalType: 'string', name: 'contentURI', type: 'string' },
    { indexed: false, internalType: 'address', name: 'collectModule', type: 'address' },
    { indexed: false, internalType: 'bytes', name: 'collectModuleReturnData', type: 'bytes' },
    { indexed: false, internalType: 'address', name: 'referenceModule', type: 'address' },
    { indexed: false, internalType: 'bytes', name: 'referenceModuleReturnData', type: 'bytes' },
    { indexed: false, internalType: 'uint256', name: 'timestamp', type: 'uint256' },
  ];

  const badData =
    '0x00000000000000000000000000000000000000000000000000000000000000c000000000000000000000000023b9467334beb345aaa6fd1545538f3d54436e960000000000000000000000000000000000000000000000000000000000000220000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002600000000000000000000000000000000000000000000000000000000062e6cbbe0000000000000000000000000000000000000000000000000000000000000131646174613a2c7b2276657273696f6e223a22312e302e30222c226d657461646174615f6964223a2235623433383734632d393831392d343637652d396638652d386133326631653430356663222c226465736372697074696f6e223a22676d2028bf8cf09f2c20bf8cf09f29222c22636f6e74656e74223a22676d2028bf8cf09f2c20bf8cf09f29222c2265787465726e616c5f75726c223a6e756c6c2c22696d616765223a6e756c6c2c22696d6167654d696d6554797065223a6e756c6c2c226e616d65223a22506f73742062792040646f6e6f736f6e61756d637a756b222c2261747472696275746573223a5b7b22747261697454797065223a2274797065222c2276616c7565223a22706f7374227d5d2c226d65646961223a5b5d2c226170704964223a224c656e73746572227d000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000';

  const result = ethers.utils.defaultAbiCoder.decode(
    unindexedData.map((i) => i.type),
    badData
  );
  expect(result).toEqual([
    '"data:,{"version":"1.0.0","metadata_id":"5b43874c-9819-467e-9f8e-8a32f1e405fc","description":"gm (����, ����)","content":"gm (����, ����)","external_url":null,"image":null,"imageMimeType":null,"name":"Post by @donosonaumczuk","attributes":[{"traitType":"type","value":"post"}],"media":[],"appId":"Lenster"}"',
    '0x23b9467334bEb345aAa6fd1545538F3d54436e96',
    '0x0000000000000000000000000000000000000000000000000000000000000001',
    '0x0000000000000000000000000000000000000000',
    '0x',
    { _hex: '0x62e6cbbe', _isBigNumber: true },
  ]);
});

you see this throws the null: invalid codepoint at offset 97; error!

Would be great to understand why this happens and also how tenderlys and etherscans decoder seems to manage to work it out.. are they doing something differently to how we do it here?

Bailing out on cases like this did cause us to have to write some bespoke code to handle the events which emit a string that could include these chars.. so we catch when ethers decoder fails and do our bespoke logic to make sure our indexer can carry on. Even if it can not decode it having a value assigned to the array index that it failed on would be nice so you can still access the other decoded information.

Let me know if you need any more info.

Thanks

Errors

`thrown: [null: invalid codepoint at offset 97; unexpected continuation byte (argument="bytes", value=Uint8Array....`

Environment

node.js (v12 or newer)

joshstevens19 avatar Aug 01 '22 10:08 joshstevens19

any idea @ricmoo

joshstevens19 avatar Aug 02 '22 21:08 joshstevens19

That error means the data is not valid UTF8 data.

You can use the recoverable error API to access it with a different strategy (such as ignore or replace), but I’m not at a computer to type in demo code right now; you basically can get the bytes from the error and use the toUtf8String function, passing in the strategy callback for errors.

Keep in mind when processing invalid UTF8 data, changing things using non-error strategies can result in exploits. It allows multiple different strings to have the same hash, for example.

ricmoo avatar Aug 02 '22 21:08 ricmoo

would love to get some code in what you mean here... do you think that's how etherscan + tenderly still manage to decode the log?

joshstevens19 avatar Aug 04 '22 22:08 joshstevens19

this came back up with some tech debt is there an elegant way to do this without losing the rest of the valid data you can decode? i tried a few things nothing worked everytime wondered if you guys have solved it

joshstevens19 avatar Oct 28 '23 16:10 joshstevens19

Ethers fully supports decoding data as long as the structure is correct and can be parsed.

In the case of an invalid string, only accessing that valid within the result will throw using the "deferred error API".

Here is an example of how to use alternate string decoding mechanisms if you data contains bad strings, but please keep in mind that care should be taken when using invalid strings as they can be used for a variety of attacks:

data = '0x00000000000000000000000000000000000000000000000000000000000000220000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000b48656c6c6f20576f6c72ff000000000000000000000000000000000000000000';

result = e.AbiCoder.defaultAbiCoder().decode([ "uint", "string" ], data)

// This is fine, since the uint at index 0 is perfectly fine
console.log(result[0])
// 34n

// This will throw however, since the string is invalid and you are *accessing* it
console.log(result[1])
// throws

// Instead, to access the invalid data, perhaps to attempt decoding using an alternate error strategy, capture the error:
try {
  result[1];
} catch (e) {
  // The offending bytes, extracted from the error
  const badBytes = e.error.value;

  // Using the ignore strategy, invalid UTF-8 code points are discarded:
  console.log(toUtf8String(badBytes, ethers.Utf8ErrorFuncs.ignore));
  // "Hello Wolr"

  // Using the replacement strategy, invalid UTF-8 code points are replaced with the UTF-8 replacement character
  // "Hello Wolr�'
}

The danger of these strategies is that byte errors are folding, so you can have the following problem:

bytesA = "0x48656c6c6f20576f6c72ff"
bytesB = "0x48656c6c6f20576f6c72fe"

console.log(bytesA == bytesB)
// false

strA = toUtf8String(bytesA, ethers.Utf8ErrorFuncs.ignore);
strB = toUtf8String(bytesB, ethers.Utf8ErrorFuncs.ignore);

console.log(a == b)
// true

Does that make sense?

ricmoo avatar Oct 28 '23 18:10 ricmoo