tfchain
tfchain copied to clipboard
Not enough public IPs in farm despite having a free public ips
Describe the bug
Mainnet, 2.4.2.
Dashboard is giving error Not enough public IPs in farm despite having a free public ips on the farm.
To Reproduce
Steps to reproduce the behavior: Just deploy any instance on the dashboard using IPv4.
Screenshots
Further investigation, these 2 IPs on farm ID 1 look free on graphql.
However, they are reserved on the chain.
{
ip: 185.69.166.147/24
gateway: 185.69.166.1
contractId: 15946
},
{
ip: 185.69.166.157/24
gateway: 185.69.166.1
contractId: 16178
}
but @sameh-farouk found these contracts are deleted, but they are still on the farm object and reserving these IPs
Update: I attempted to debug this issue today. I found the calls that created and canceled these contracts was executed with runtime 116 I inspected the codebase tagged with v2.1.4 to see if it contained any bugs, and I couldn't find anything obvious with the logic of reserving and unreserving IPs. Digging more into what happened vs how it happened, I found that:
- These contracts were created with 0 public IPs, so this shouldn't trigger the reserve IP logic.
- There have been no events to release these IPs due to these contracts' cancelation, which makes sense, since these contracts weren't reserved any public IPs.
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Ftfchain.grid.tf%2Fws#/explorer/query/5844419
What could be causing these incorrect states? It's still not clear to me, but it's probably either:
- Likely, an old migration that interfered with some of the public IPs' states
- Less likely, a hidden bug in the runtime 116
Now, how to fix this:
- The council can call
force_reset_farm_ipto force unreserve these IPs
It's important to conduct checks to ensure that there are no similar cases with other IPs on different farms as well. I'll follow up.
@LeeSmet @renauter Can we approve the suggested solution here so I can open tickets for ops to prepare and execute related proposals on mainnet and free these IPs ? Otherwise, they will remain reserved forever
I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state
- The council can call
force_reset_farm_ipto force unreserve these IPs
I think I found origin of the force_reset_farm_ip() creation.
See here:
https://github.com/threefoldtech/tfchain/pull/308
https://github.com/threefoldtech/tfchain/commit/072bdc89b3b92ac71f65b594befb5e696fcbebca
My guess is when trying to call remove_contract() in some context the _free_ip() didn't work as expected.
I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state
@LeeSmet I am currently in the process of debugging to identify the root cause of the issue. I agree with you that understanding the root cause is crucial. However, I would like to present a counter-argument that resetting these IPs could potentially lead to more dangling states. This is because these contracts have been removed from the chain, and it doesn’t seem logical to keep these IPs reserved for contracts that no longer exist.
Please note that the effect of the call is limited to the provided IP, and it only resets its contract id to 0. The code is straightforward, and I don’t foresee any harm in proceeding in this manner. We can certainly wait until the root cause is identified, but regardless, we will need to execute these calls eventually to free up these IPs, in addition to any other fixes.
Lastly, I would like to point out that freeing these IPs will not hinder any investigations since all activities are recorded on the blockchain. Therefore, I propose that we work in parallel by freeing these IPs while continuing to investigate this incident. This approach is particularly important as the issue partially affects deploying on mainnet.
Alright, I’ve managed to find the root cause of this erroneous state. Into the details. We begin with the instance at IP 185.69.166.157 and the contracts 16178 that are reserving it on-chain.
The creation of Contract 16178 occurred at block 5844419. As previously mentioned, it was created with 0 IP, so it shouldn’t have reserved any IP.
However, an examination of the state of farm 1 at the parent block shows that the IP was already reserved to contract 16178 before its creation.
I conducted a backward search to determine when exactly the IP was reserved for this specific contract and found that this occurred at block 5844377. I then began to scrutinize the events and calls at this block to understand how we arrived at this state after the block was finalized.
At this state of the block, the last contract ID was 16177. However, there was a failed transaction. This transaction attempted to create a node contract, which, if successful, would have been assigned the next available ID of 16178.
Upon examining the call arguments and decoding the error details, we can ascertain two things: the optional solution provider provided by the user was 2, and the error that caused the transaction to fail was SolutionProviderNotApproved. To verify this, I checked the solution provider with ID 2 at this block state.
{
solutionProviderId: 2
providers: [
{
who: 5HYbojcq9jBrxtuW3wfgMSuEq4G6BTMN2vW2ZDiYTKJXVEmF
take: 50
}
]
description: zonaris is a decentralised blockchain node hosting solution built on the ThreeFold Grid.
link: https://www.zonaris.io/
approved: false
}
Now, let’s piece together the final part of the puzzle: why did this failed transaction alter the state?
In older versions of Substrate, if you wrote a function within the runtime that could modify storage and if storage was altered during that extrinsic, it was impossible to undo those specific changes.
This led to a crucial “best practice” in runtime development: “verify first, write last”. This principle states that you should never return an Err from an extrinsic after you have modified storage, as those changes will persist, potentially leading to an inconsistent state.
Looking at carete_node_contract, Initially, it validates few conditions, followed by a call to the _create_contract function. This function start to modify the state, reserves the IP, but then attempts to validate the solution provider. If this validation fails, it results in an error, leaving the IP reserved for a contract that wasn't created yet.
https://github.com/threefoldtech/tfchain/blob/c242557048e5ef59a14bc47cb1cbb0f24538cce1/substrate-node/pallets/pallet-smart-contract/src/lib.rs#L648-L652
To the best of my knowledge, this issue is no longer pertinent as the behavior has been altered later in the more recent releases of Substrate. Currently, the extrinsic execution logic discards any changes if an error occurs.
So in the most recent version of TFChain runtime, this issue can't occurs. It’s also straightforward to try to reproduce and verify.
Great investigation ! Thank you.
I am ok with the solution since code is already in place (probably for same reason) and had already been used to restore coherency of storage state probably.
To evaluate the number of IPs which are reserved but should not an additional check could be added to the pallet-tfgrid live checking https://github.com/threefoldtech/tfchain/pull/817. Then we could have a list of IPs to "free".
Update: A ticket was submitted for ops to handle freeing these two public IPs
Verified!