tfchain icon indicating copy to clipboard operation
tfchain copied to clipboard

[validators] confirmation of correct flags and procedures

Open coesensbert opened this issue 1 year ago • 9 comments

It's been very long since we have added/removed validators to tfchain, for any net. Our docs and procedures are probably outdated. That was definitely the case regarding validator keys, but this is resolved now here: https://docs.grid.tf/threefold/itenv_threefold_main/src/branch/master/grid_operations/grid_tfchain#re-inserting-re-setting-session-aura-gran-keys-to-same-as-controller-account

These are some of our old docs on adding/removing validators: https://docs.grid.tf/threefold/itenv_threefold_main/src/branch/master/kubernetes_clusters/hagrid-prod2/applications/tfchainmainnet/Adding-validators.md https://docs.grid.tf/threefold/itenv_threefold_main/src/branch/master/kubernetes_clusters/hagrid-prod2/applications/tfchainmainnet/Removing-validators.md

This is related to:

  • https://github.com/threefoldtech/grid_deployment/issues/52
  • https://github.com/threefoldtech/grid_deployment/tree/development/tfchain-validator/mainnet

Can dev confirm:

  • are these procedures still valid? If not we should make new ones and test
  • are these flags correct for a validator node? -> https://github.com/threefoldtech/grid_deployment/blob/development/tfchain-validator/mainnet/docker-compose.yml#L25-L44

coesensbert avatar Jun 13 '24 12:06 coesensbert

@sameh-farouk any news on this? Do you need more info from @coesensbert?

Thanks!

mik-tf avatar Aug 16 '24 15:08 mik-tf

are these procedures still valid? If not we should make new ones and test

The procedures for adding a new validator remain unchanged. However, the referenced documentation is inaccurate. Using the author_rotateKeys RPC call is a simpler alternative to generating the key with subkey generate and inserting it into the node’s keystore with key insert. Executing both sequentially is incorrect.

Also, adjustments are needed where the documentation refers to the sudo module is required. The Council module should be used instead.

I will review the docs here and test the flow. I'll ensure it's revised and simplified, so you can update ops documentation accordingly.

are these flags correct for a validator node? -> https://github.com/threefoldtech/grid_deployment/blob/development/tfchain-validator/mainnet/docker-compose.yml#L25-L44

Here are my comments regarding the mentioned flags:

  • Regarding the flags, as I previously mentioned here, there’s no need to use archive mode. Instead, use --state-pruning 1000 --blocks-pruning archive for optimal storage usage.

  • --rpc-cors all is unnecessary since RPC is only listening on localhost and should be omitted.

  • Starting with TFchain 2.8.0, --bootnodes xxx can also be omitted as all bootnodes are embedded in the chain-spec.

sameh-farouk avatar Sep 01 '24 11:09 sameh-farouk

Great, once the flow is tested and docs updated I can continue finish the validator for the guardian stack. Thanks for the flag suggestions, resolved: https://github.com/threefoldtech/grid_deployment/commit/e4de06b7c9ece11da35c57805e9e843995174485

  • Regarding the flags, as I previously mentioned here, there’s no need to use archive mode. Instead, use --state-pruning 1000 --blocks-pruning archive for optimal storage usage.

We use the tfchain public RPC snapshot data to speed up a validator syncing with the chain. This snapshot is generated with a node with these flags: https://github.com/threefoldtech/grid_deployment/blob/development/grid-snapshots/devnet/docker-compose.yml#L10-L45 Can we still use these snapshots if we apply the different pruning flags? https://bknd.snapshot.grid.tf/

coesensbert avatar Sep 03 '24 14:09 coesensbert

Can we still use these snapshots if we apply the different pruning flags?

No, they won't be compatible. This why I was recommend building two types of snapshots. one contain the entire chain and another that only contain the most recent 1000 blocks. Please note that changing state-pruning requires purging the database and syncing from scratch.

sameh-farouk avatar Sep 03 '24 14:09 sameh-farouk

Can we still use these snapshots if we apply the different pruning flags?

No, they won't be compatible. This why I was recommend building two types of snapshots. one contain the entire chain and another that only contain the most recent 1000 blocks. Please note that changing state-pruning requires purging the database and syncing from scratch.

Successfully synced a devnet node from 0 with the new pruning flags. Took about 17h on an i5-12500 with nvme ssds. Stored data size is around 13GB, while a public RPC node has 110G. So that's good, we can lower the storage requirements by a lot. Need to do the same for mainnet to get the size there.

While it seems obvious indeed to have snapshots, this will present 4 new nodes to create the snapshots and more maintenance for ops. Since validators will only added for mainnet, does it make sense to only have a snapshot creator for mainnet in this case?

coesensbert avatar Sep 10 '24 15:09 coesensbert

Took about 17h

Nice work! What is the bandwidth of the machine? Curious to know. Is the bottleneck at the network or the disk speed?

Since validators will only added for mainnet, does it make sense to only have a snapshot creator for mainnet in this case?

Excellent question. IMO I agree with you here we can only go with mainnet snapshot for now. It can be discussed with the team in the following days. Will let you know if I have more info on my end.

mik-tf avatar Sep 11 '24 03:09 mik-tf

does it make sense to only have a snapshot creator for mainnet in this case?

As you already know, Snapshots are primarily used to speed up the process of syncing new nodes when necessary, whether for adding new validators or migrating them to another machine. This procedure is not mandatory and is actually advised against in some cases due to security considerations.

From a development perspective, I have no advice here. It's better to check with team leads regarding the trade-offs you want to make. Time could be more precious in some instances.

But I have a question: why do we need an extra node for snapshot creation? Couldn't we just utilize one of the boot nodes for that as well?

sameh-farouk avatar Sep 11 '24 14:09 sameh-farouk

@sabrinasadik is checking this with @coesensbert in a couple of days (after September 19). This issue will then be updated.

mik-tf avatar Sep 12 '24 14:09 mik-tf

Update: See the PR with the updated docs https://github.com/threefoldtech/tfchain/pull/1007

sameh-farouk avatar Sep 19 '24 21:09 sameh-farouk

can be closed, thx

coesensbert avatar Oct 18 '24 09:10 coesensbert