tfchain icon indicating copy to clipboard operation
tfchain copied to clipboard

Not enough public IPs in farm despite having a free public ips

Open A-Harby opened this issue 1 year ago • 1 comments

Describe the bug

Mainnet, 2.4.2.

Dashboard is giving error Not enough public IPs in farm despite having a free public ips on the farm.

To Reproduce

Steps to reproduce the behavior: Just deploy any instance on the dashboard using IPv4.

Screenshots

image image

A-Harby avatar May 22 '24 14:05 A-Harby

Further investigation, these 2 IPs on farm ID 1 look free on graphql.

image

However, they are reserved on the chain.

    {
      ip: 185.69.166.147/24
      gateway: 185.69.166.1
      contractId: 15946
    },
    {
      ip: 185.69.166.157/24
      gateway: 185.69.166.1
      contractId: 16178
    }

but @sameh-farouk found these contracts are deleted, but they are still on the farm object and reserving these IPs

AhmedHanafy725 avatar May 22 '24 14:05 AhmedHanafy725

Update: I attempted to debug this issue today. I found the calls that created and canceled these contracts was executed with runtime 116 I inspected the codebase tagged with v2.1.4 to see if it contained any bugs, and I couldn't find anything obvious with the logic of reserving and unreserving IPs. Digging more into what happened vs how it happened, I found that:

  • These contracts were created with 0 public IPs, so this shouldn't trigger the reserve IP logic.
  • There have been no events to release these IPs due to these contracts' cancelation, which makes sense, since these contracts weren't reserved any public IPs.

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Ftfchain.grid.tf%2Fws#/explorer/query/5844419

What could be causing these incorrect states? It's still not clear to me, but it's probably either:

  • Likely, an old migration that interfered with some of the public IPs' states
  • Less likely, a hidden bug in the runtime 116

Now, how to fix this:

  • The council can call force_reset_farm_ip to force unreserve these IPs

It's important to conduct checks to ensure that there are no similar cases with other IPs on different farms as well. I'll follow up.

sameh-farouk avatar May 29 '24 20:05 sameh-farouk

@LeeSmet @renauter Can we approve the suggested solution here so I can open tickets for ops to prepare and execute related proposals on mainnet and free these IPs ? Otherwise, they will remain reserved forever

sameh-farouk avatar Jun 02 '24 10:06 sameh-farouk

I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state

LeeSmet avatar Jun 03 '24 14:06 LeeSmet

  • The council can call force_reset_farm_ip to force unreserve these IPs

I think I found origin of the force_reset_farm_ip() creation. See here: https://github.com/threefoldtech/tfchain/pull/308 https://github.com/threefoldtech/tfchain/commit/072bdc89b3b92ac71f65b594befb5e696fcbebca

My guess is when trying to call remove_contract() in some context the _free_ip() didn't work as expected.

renauter avatar Jun 04 '24 04:06 renauter

I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state

@LeeSmet I am currently in the process of debugging to identify the root cause of the issue. I agree with you that understanding the root cause is crucial. However, I would like to present a counter-argument that resetting these IPs could potentially lead to more dangling states. This is because these contracts have been removed from the chain, and it doesn’t seem logical to keep these IPs reserved for contracts that no longer exist.

Please note that the effect of the call is limited to the provided IP, and it only resets its contract id to 0. The code is straightforward, and I don’t foresee any harm in proceeding in this manner. We can certainly wait until the root cause is identified, but regardless, we will need to execute these calls eventually to free up these IPs, in addition to any other fixes.

Lastly, I would like to point out that freeing these IPs will not hinder any investigations since all activities are recorded on the blockchain. Therefore, I propose that we work in parallel by freeing these IPs while continuing to investigate this incident. This approach is particularly important as the issue partially affects deploying on mainnet.

sameh-farouk avatar Jun 04 '24 11:06 sameh-farouk

Alright, I’ve managed to find the root cause of this erroneous state. Into the details. We begin with the instance at IP 185.69.166.157 and the contracts 16178 that are reserving it on-chain.

The creation of Contract 16178 occurred at block 5844419. As previously mentioned, it was created with 0 IP, so it shouldn’t have reserved any IP.

Screenshot_20240605_160052

However, an examination of the state of farm 1 at the parent block shows that the IP was already reserved to contract 16178 before its creation.

I conducted a backward search to determine when exactly the IP was reserved for this specific contract and found that this occurred at block 5844377. I then began to scrutinize the events and calls at this block to understand how we arrived at this state after the block was finalized.

At this state of the block, the last contract ID was 16177. However, there was a failed transaction. This transaction attempted to create a node contract, which, if successful, would have been assigned the next available ID of 16178.

Screenshot_20240605_160653

Upon examining the call arguments and decoding the error details, we can ascertain two things: the optional solution provider provided by the user was 2, and the error that caused the transaction to fail was SolutionProviderNotApproved. To verify this, I checked the solution provider with ID 2 at this block state.

{
  solutionProviderId: 2
  providers: [
    {
      who: 5HYbojcq9jBrxtuW3wfgMSuEq4G6BTMN2vW2ZDiYTKJXVEmF
      take: 50
    }
  ]
  description: zonaris is a decentralised blockchain node hosting solution built on the ThreeFold Grid. 
  link: https://www.zonaris.io/
  approved: false
}

Now, let’s piece together the final part of the puzzle: why did this failed transaction alter the state?

In older versions of Substrate, if you wrote a function within the runtime that could modify storage and if storage was altered during that extrinsic, it was impossible to undo those specific changes.

This led to a crucial “best practice” in runtime development: “verify first, write last”. This principle states that you should never return an Err from an extrinsic after you have modified storage, as those changes will persist, potentially leading to an inconsistent state.

Looking at carete_node_contract, Initially, it validates few conditions, followed by a call to the _create_contract function. This function start to modify the state, reserves the IP, but then attempts to validate the solution provider. If this validation fails, it results in an error, leaving the IP reserved for a contract that wasn't created yet.

https://github.com/threefoldtech/tfchain/blob/c242557048e5ef59a14bc47cb1cbb0f24538cce1/substrate-node/pallets/pallet-smart-contract/src/lib.rs#L648-L652

To the best of my knowledge, this issue is no longer pertinent as the behavior has been altered later in the more recent releases of Substrate. Currently, the extrinsic execution logic discards any changes if an error occurs.

So in the most recent version of TFChain runtime, this issue can't occurs. It’s also straightforward to try to reproduce and verify.

sameh-farouk avatar Jun 05 '24 13:06 sameh-farouk

Great investigation ! Thank you.

I am ok with the solution since code is already in place (probably for same reason) and had already been used to restore coherency of storage state probably.

To evaluate the number of IPs which are reserved but should not an additional check could be added to the pallet-tfgrid live checking https://github.com/threefoldtech/tfchain/pull/817. Then we could have a list of IPs to "free".

renauter avatar Jun 05 '24 14:06 renauter

Update: A ticket was submitted for ops to handle freeing these two public IPs

sameh-farouk avatar Jun 06 '24 09:06 sameh-farouk

Verified!

Screenshot_20240606_125455 Screenshot_20240606_130244

sameh-farouk avatar Jun 06 '24 10:06 sameh-farouk