graph-node icon indicating copy to clipboard operation
graph-node copied to clipboard

Random "invalid utf-16: lone surrogate found" errors (indexing issues on GCP)

Open juanmardefago opened this issue 4 years ago • 7 comments

As of late, while working on the Loopring subgraph, I've found that many times my subgraph would completely fail, and a non-change + redeploy would fix it (by non-change I mean adding a log, or some other code change that would only affect the hash so the subgraph can be re-deployed properly).

Just redeployed Loopring for a small change, and failed with that same error again, here's the handler error:

Handler skipped due to execution failure, error: invalid utf-16: lone surrogate found wasm backtrace: 0: 0x2396 - <unknown>!src/utils/helpers/account/getOrCreateAccountTokenBalance 1: 0x2559 - <unknown>!src/utils/helpers/transactionProcessors/deposit/processDeposit 2: 0x3e6e - <unknown>!src/utils/helpers/transaction/processTransactionData 3: 0x3fbe - <unknown>!src/utils/helpers/block/processBlockData 4: 0x4139 - <unknown>!src/mappings/exchangev3/handleSubmitBlocks 5: 0x4166 - <unknown>!src/mappings/exchangev3/handleSubmitBlocksV1 , handler: handleSubmitBlocksV1, block_hash: 0x1cb01ff512906207f614179185bdf4a8c76db53f7ad0ab59f2d0b3e13523f11c, block_number: 11560131

image

Since this issue happens randomly, and is randomly fixed by re-deploying, I'm guessing it's related to some particular versions/instances of the graph-nodes currently available within the hosted service, and re-deploying it assigns it to a different instance, with a different graph-node version, that eventually gets past the error itself, but I can't be 100% sure since I have no visibility of which instance it's running on, or what version of the graph-node that instance is running.

Deployment ID where it reproduced: QmVDjnVVfKDqn9nA7Jyqq1JrCxcPH8LHmZYS9zy5N3x4sr

juanmardefago avatar May 14 '21 22:05 juanmardefago

Updating here again.

I haven't found out what's causing this, but to better exemplify what I'm experiencing: Here's the commit for the last time it broke: https://github.com/protofire/loopringv3.6-subgraph/commit/66315952f81c4b2d2adf77975af4311f542a9681

And here's the commit for when I managed to deploy a working version again: https://github.com/protofire/loopringv3.6-subgraph/commit/d63ecf112891f799e239a48733e72ff1861b644c

On the fixed commit all I did was move around a definition in the yaml, so that the subgraph ID would change and get indexed as a new subgraph, instead of recycling the old indexing which failed, there was no code change. In fact, I had to do the exact same thing a few times until it actually indexed correctly (move around the entity list in the yaml, to trigger a re-index)

It's not happening in all subgraphs I've been working with, so there might be something particular on this subgraph that makes it easy to reproduce (it's triggered a few other bugs in the graph-node, since it does massive amounts of entity saves on each handler, because of what this subgraph actually does, which is to unpack L2 rollup data onto L1).

juanmardefago avatar May 18 '21 19:05 juanmardefago

Experiencing something similar with my recent changes: https://github.com/airswap/airswap-protocols/pull/621/files Deployment ID where it reproduced: QmS2NGYxiva8mZKiUARqvpdyLyQPF2wDgWa52ACJiEhMUV

Screen Shot 2021-07-02 at 09 18 58

ejwessel avatar Jul 02 '21 16:07 ejwessel

Something I discovered recently on my most recent deployment QmWXsMVVk5kQF6Z3UvwR81oksEbrf6vps3EeMSvqZMaPEV which may help. The failure happens when trying to use a contract after the .bind()

In something like (with regards to my above changes):

let swapLightContract = SwapContract.bind(event.address)
owner = swapLightContract.owner() <------ FAILS HERE

It doesn't matter what method I choose that exists within the contract it fails with no error and no warnings. However, when querying for fatal errors on the subgraph through the endpoint it still shows:

{
   "data":{
      "indexingStatusForPendingVersion":{
         "fatalError":{
            "message":"invalid utf-16: lone surrogate found\twasm backtrace:\t    0: 0x19b0 - <unknown>!<wasm function 73>\t    1: 0x1a08 - <unknown>!<wasm function 74>\t    2: 0x1cc4 - <unknown>!<wasm function 75>\t in handler `handleSwap` at block #12057420 (be8c71077a1d4dd8d3da788df9d8e37c21033a499ba7b037fc421c4faf151384)"
         },
         "nonFatalErrors":[
            
         ],
         "subgraph":"QmWXsMVVk5kQF6Z3UvwR81oksEbrf6vps3EeMSvqZMaPEV"
      }
   }
}

ejwessel avatar Jul 08 '21 15:07 ejwessel

Looking into the Airswap case, this was determined to be some weirdness with the AS compiler. This was fixed in the AS version update that is currently released as graph-cli/-ts versions 0.22.0-alpha.1, and will soon be released as 0.23.

leoyvens avatar Aug 16 '21 15:08 leoyvens

@juanmardefago I don't know if you are still seeing this with the Loopring subgraph, but would be great to verify that this fixes the issue (I shared the alpha migration guide with you separately). Will close this issue when the new version is released.

azf20 avatar Aug 19 '21 11:08 azf20

This is resolved in apiVersion 0.0.5, which uses the latest version of AssemblyScript, migration guide is here: https://thegraph.com/docs/developer/assemblyscript-migration-guide @ejwessel @juanmardefago please re-open if you see this recur with the new version!

azf20 avatar Sep 29 '21 18:09 azf20

@azf20 I was discussing with @evaporei regarding this issue since it seems it hasn't been resolved actually, but rather just changed the error message due to the assemblyscript upgrade or subsequent graph-node updates.

The behaviour is still the same, it would randomly fail, with an unexpected null on this line

The token that is reported as null cannot ever be null, because if it were, it would be a deterministic failure and it would always fail, but the subgraph itself is relatively peculiar in the sense that it's a subgraph that needs to decode calldata, and what we thought could be causing non-deterministic issues is some sort of difference in the provider side of things (maybe the calldata is too big, or given that it's callhandlers maybe the provider doesn't send the correct data to the graph-node, and a retry makes it choose a different graph-node with a different provider making it be able to be indexed correctly).

juanmardefago avatar May 05 '22 16:05 juanmardefago

Meet the same issue on local development with docker, and the graph-node version is,

graph-node 0.23.1 (2021-06-23)

graph-node_1  | Aug 25 10:58:25.036 ERRO Subgraph instance failed to run: invalid utf-16: lone surrogate found	wasm backtrace:	    0: 0x14f2 - <unknown>!<wasm function 64>	    1: 0x2b52 - <unknown>!<wasm function 172>...

Any way to debug the wasm binary for further information?

linxux avatar Aug 25 '22 12:08 linxux

@linxux Please use the latest graph-cli, graph-ts and graph-node and see if it still reproduces.

leoyvens avatar Aug 25 '22 13:08 leoyvens

@leoyvens

Everything works fine after updating the latest version.

  • @graphprotocol/graph-cli (0.33.0)
  • @graphprotocol/graph-ts (0.27.0)

docker pull graphprotocol/graph-node:latest

And the latest graph-node in local also shows more specific trace errors. It is very helpful in developing.

linxux avatar Aug 29 '22 13:08 linxux