avalanchego icon indicating copy to clipboard operation
avalanchego copied to clipboard

Subnet-EVM not receiving buildBlock calls

Open lumos42 opened this issue 1 year ago • 9 comments

Describe the bug Running into a problem with our Fuji subnet. It’s running with a single validator, but BuildBlock is not being called at all. We’ve tried with multiple validators and some chain config tweaks too. I checked in block_builder that it sends the signal, but never receives a call on buildBlock. How can we debug this?

It’s running a custom VM merged with upstream subnet-evm v0.6.1, and Avalanchego v1.11.1. It was already running this version for a few days when we added and removed some validators to test a new feature, and the block production stopped. getCurrentValidator shows exactly one validator and it's connected. Also, this validator can reach other subnet nodes (checked using info.peers API). The subnet was created on Oct-3-2023.

To Reproduce Not sure how to reproduce this exact situation. These versions are running ok on local and mainnet environments.

Operating System Ubuntu 22.04.2 LTS

lumos42 avatar Mar 13 '24 07:03 lumos42

Could you confirm that it's not calling BuildBlockWithContext either? It may call this instead after the Durango activation to provide the P-Chain context for Warp verification.

I see that this is implemented in your VM here: https://github.com/hubble-exchange/hubblenet/blob/main/plugin/evm/vm.go#L820 but just want to confirm.

Would it be possible to provide steps to reproduce on your setup? This would make it much easier to help provide assistance for a Custom VM, since we are not as familiar with your code as ours.

If you have a failing e2e test showing the flow from:

  1. Transaction issuance
  2. VM sending commonEng.PendingTxs to the engine
  3. Resulting delay or absence of a call from the engine to either BuildBlock or BuildBlockWithContext

aaronbuchwald avatar Mar 13 '24 19:03 aaronbuchwald

@aaronbuchwald a unit test will not fail since these same versions are working ok on mainnet and local subnets. I'm adding some logs from the validator node. So our validator checks the orderbook for matches and then calls blockBuilder.signalTxsReady if potential matches are found. But buildBlockWithContext is not called.

t=2024-03-14T07:45:03.046361+0000 lvl=info msg="matching pipeline produced transactions; calling signalTxsReady" type=hubble caller=limit_order.go:183
t=2024-03-14T07:45:03.046420+0000 lvl=info msg="in signalTxsReady; calling markBuilding" type=system caller=block_builder.go:138
t=2024-03-14T07:45:03.046435+0000 lvl=info msg="in markBuilding"                     buildSent=false type=system caller=block_builder.go:111
t=2024-03-14T07:45:04.169600+0000 lvl=info msg="Setting new local account"           address=0x713DAAE39D84BD226b1971F109E2f5Dde2a734F6 type=system caller=txpool.go:982
t=2024-03-14T07:45:04.169816+0000 lvl=info msg="Submitted transaction"               hash=0x11f19b17cfcb5068606d3f2ec98ad87d8f1fec6e37a3a6f499b8376d34ae13f6 from=0x713DAAE39D84BD226b1971F109E2f5Dde2a734F6 nonce=159559 recipient=0x522879A79412299C0319bFD44494f0FE00B630c5 value=0 type=2      gasFeeCap=11000000000 gasTipCap=1000000000 gasPrice=11000000000 type=system caller=api.go:1899
t=2024-03-14T07:45:04.169996+0000 lvl=info msg="in signalTxsReady; calling markBuilding" type=system caller=block_builder.go:138
t=2024-03-14T07:45:04.170042+0000 lvl=info msg="in markBuilding"                     buildSent=true  type=system caller=block_builder.go:111
t=2024-03-14T07:45:07.979148+0000 lvl=info msg="matching pipeline produced transactions; calling signalTxsReady" type=hubble caller=limit_order.go:183
t=2024-03-14T07:45:07.979199+0000 lvl=info msg="in signalTxsReady; calling markBuilding" type=system caller=block_builder.go:138
t=2024-03-14T07:45:07.979226+0000 lvl=info msg="in markBuilding"                     buildSent=true  type=system caller=block_builder.go:111
t=2024-03-14T07:45:10.607884+0000 lvl=info msg="Submitted transaction"               hash=0x17866f5c480bf4d43e104f5e5989095d8a7e928c6a8f0fba11d0bfa792e0fc24 from=0x713DAAE39D84BD226b1971F109E2f5Dde2a734F6 nonce=159559 recipient=0x522879A79412299C0319bFD44494f0FE00B630c5 value=0 type=2      gasFeeCap=12375000000 gasTipCap=1125000000 gasPrice=12375000000 type=system caller=api.go:1899
t=2024-03-14T07:45:10.607978+0000 lvl=info msg="in signalTxsReady; calling markBuilding" type=system caller=block_builder.go:138
t=2024-03-14T07:45:10.608033+0000 lvl=info msg="in markBuilding"                     buildSent=true  type=system caller=block_builder.go:111
t=2024-03-14T07:45:12.979518+0000 lvl=info msg="matching pipeline produced transactions; calling signalTxsReady" type=hubble caller=limit_order.go:183
t=2024-03-14T07:45:12.979565+0000 lvl=info msg="in signalTxsReady; calling markBuilding" type=system caller=block_builder.go:138
t=2024-03-14T07:45:12.979586+0000 lvl=info msg="in markBuilding"                     buildSent=true  type=system caller=block_builder.go:111
t=2024-03-14T07:45:17.125724+0000 lvl=info msg="Submitted transaction"               hash=0xee158d4e826ec876dcf7ede6b8f708f390de3b2915e57c5e50264df24ebb6026 from=0x713DAAE39D84BD226b1971F109E2f5Dde2a734F6 nonce=159559 recipient=0x522879A79412299C0319bFD44494f0FE00B630c5 value=0 type=2      gasFeeCap=13921875000 gasTipCap=1265625000 gasPrice=13921875000 type=system caller=api.go:1899
t=2024-03-14T07:45:17.125807+0000 lvl=info msg="in signalTxsReady; calling markBuilding" type=system caller=block_builder.go:138
t=2024-03-14T07:45:17.125915+0000 lvl=info msg="in markBuilding"                     buildSent=true  type=system caller=block_builder.go:111
t=2024-03-14T07:45:17.982261+0000 lvl=info msg="matching pipeline produced transactions; calling signalTxsReady" type=hubble caller=limit_order.go:183
t=2024-03-14T07:45:17.982324+0000 lvl=info msg="in signalTxsReady; calling markBuilding" type=system caller=block_builder.go:138
t=2024-03-14T07:45:17.982345+0000 lvl=info msg="in markBuilding"                     buildSent=true  type=system caller=block_builder.go:111

As you can see in the logs, buildSent=false the first time markBuilding is called, but it's always true in all subsequent calls. Normally #### Building block with context would be found in the logs after signalTxsReady is called, but it's not happening. buildBlock is not not called even when regular transactions are submitted. This validator is running debug branch.

lumos42 avatar Mar 14 '24 07:03 lumos42

this is the response of getCurrentValidators when called locally

{
   "id" : 1,
   "jsonrpc" : "2.0",
   "result" : {
      "validators" : [
         {
            "connected" : true,
            "endTime" : "1712819977",
            "nodeID" : "NodeID-Hu8J35p6G59WuYEsNtKDMYqruxRXnqBGX",
            "stakeAmount" : "20",
            "startTime" : "1710142077",
            "txID" : "5yWX6HdWjNpT6AGfEmooBBv42xuUwbb2oAoFwtGri3xBRHz5k",
            "uptime" : "100.0000",
            "weight" : "20"
         }
      ]
   }
}

Let me know if I can provide more data that could help diagnose this.

lumos42 avatar Mar 14 '24 07:03 lumos42

@aaronbuchwald I added some logs to avalanchego v1.11.1 on the fuji testnet to investigate this. From the logs it seems like avalanchego does not receive the notification when subnet-evm sends a PendingTxs message on the channel. It only received gossip_request messages in this function.

I tried the same on local network and 2 different fuji nodes(both were the only subnet node at the time of testing) with the same versions. On fuji nodes, handleChanMsg didn't receive the notification. On local network, avalanchego immediately received a PendingTxs message when evm's markBuilding was called.

On fuji nodes, I even tried changing the avalanchego version(and the corresponding evm version) to v1.10.18, but I got the same results.

What could be the reason for this? What should I do to investigate further?

Subnet ID on Fuji: t2WSjSsoE3geV9ARu5r7gzTc5UayePy3NxDrSTx7hadLYvqbg

lumos42 avatar Mar 19 '24 06:03 lumos42

Hmm the VM is wrapped by the ProposerVM as well, which passes the notification through only when it is that node's turn to propose a block.

This ensures the VM only needs to worry about when it has transactions and could propose a block, and the ProposerVM will pass the notification to the engine when it is in fact it's turn to propose a block according to Snowman++.

May be worth looking here: https://github.com/ava-labs/avalanchego/blob/master/vms/proposervm/README.md

aaronbuchwald avatar Mar 19 '24 14:03 aaronbuchwald

Thank you for the prompt reply @aaronbuchwald.

By investigating further in proposervm, I found that our fuji node returns a duration of 242h from MinDelayForProposer. It's because no blocks have been produced for more than 11 days. Also the way I understand getPostDurangoSlotTime, if no blocks were produced for a long time then the validator has to wait for that duration before proposing.

Could this be the issue?

lumos42 avatar Mar 21 '24 11:03 lumos42

Hmm MinDelayForProposer should not be returning a duration of 242h on the node that is the only validator of the network.

The value returned from MinDelayForProposer should return the minimum amount of time that the node needs to wait before proposing a block as outlined here: https://github.com/ava-labs/avalanchego/blob/master/vms/proposervm/README.md#snowman-proposers-selection-mechanism. The current ProposerVM implementation should start with a single selected proposer and then if it does not propose within 5s, it should open up to more of the validator set.

Could you confirm that those logs are coming from the one nodeID that is a validator: NodeID-Hu8J35p6G59WuYEsNtKDMYqruxRXnqBGX based on the API response above?

aaronbuchwald avatar Mar 26 '24 20:03 aaronbuchwald

Could you share a zip of the validator's log directory? I'd be interested to see the debug logs from the ProposerVM to see the path that it takes to returning 242h.

aaronbuchwald avatar Mar 26 '24 20:03 aaronbuchwald

In MinDelayForProposer, what P-chain height is being referenced? The parent P-Chain height determines the validator set to produce the next block, so if the validator set has changed and that validator is no longer active that may be the issue.

If you run platform.getValidatorsAt https://docs.avax.network/reference/avalanchego/p-chain/api#platformgetvalidatorsat at that P-Chain height and see the validator set is either empty or the validators included at that height are no longer validating the subnet, then that would explain the issue.

aaronbuchwald avatar Mar 26 '24 22:03 aaronbuchwald

Hi @aaronbuchwald, upon checking the validator at that p-chain height I found that there was only one validator at that height and the node had been stopped. Upon starting that node the chain started to build blocks again.

Does it mean that if there's a single validator at any time and we need to switch validators, we must keep the old validator running until the new validator has started producing blocks?

lumos42 avatar Apr 02 '24 08:04 lumos42

Yup, as of the DUpgrade, only validators are allowed to propose blocks via the ProposerVM.

There's already an assumption that your validator set should never be empty because this makes it impossible to perform consensus (nobody has a vote).

Since the P-Chain height of the parent block determines the validator set used by the ProposerVM for the next block, if your network does not produce any blocks during the time that it switches from a single validator (old) to a different single validator (new), then it's still only the old validator that is allowed to propose the next block. TLDR; yes you need to ensure that there is no point where there is no online validator from the perspective of the Subnet itself (which depends on the last accepted block's P-Chain height).

This is a bit of an edge case introduced with Durango for networks that are not producing many or any blocks and it's probably good reason to require VMs to produce blocks at a regular cadence instead of being fully quiescent.

aaronbuchwald avatar Apr 02 '24 13:04 aaronbuchwald