pocket-core LeanPOKT

For transparency sakes:

This linked PR outlines the comments and conversations during the time in which LeanPokt was closed source. To read more about the reason behind closed source, please read Core team's comments on why it was closed source.

Tendermint Branch

As part of this PR, we've made modifications to the forked tendermint, you can view it here

Disclaimer:

This PR has only been peer reviewed and is not Q/A tested yet. It is based off the V0.9 integration branch, and contains a lot of untested features that are not leanpokt related (PIP-22, max block size changes, etc). It should not be used in a production fleet. This should be only be used for testing/documentation references purposes until it is properly Q/A tested. If you do run in production, do know you are assuming the risks of:

Missing rewards
Causing chain halt
Disrupting servicing quality

If you are looking for LeanPOKT that is based off v0.8.2 for testing, please visit: https://github.com/fiagdao/leanpocket/tree/leanpocket-v0-copy

Name: ThunderHead (Pierre Spiegel & Addison Spiegel) | BaaS Pools LLC dba PoktFund (Erick Ho & Tommy Ho)

Design Specification https://github.com/pokt-network/pocket-core/issues/1437

Please explain your change in detail. LeanPocket is a large optimization to the Pocket Core’s Client (PCC) by allowing multiple nodes to utilize one full node. Servicers/Validators (node set) can now leverage the same state cache, and blockchain data, and no longer have to validate as many transactions in a block as the node set grows. This reduces the number of resources needed for “n” nodes to a constant number, O(N) to O(1) for memory, space, io, and networking. Read more here

Functional Goals

Add multiple servicers support under a PCC to handle relays
Add multiple validators' support under a PCC to participate in voting rounds and block proposing.
Add additional Prometheus metrics for the node set

Modifications made to achieve aforementioned goals

Introduction of the PocketNode struct to represent one node under PCC and a Global Map of nodes that maps a address to PocketNode for each incoming each relay
Relay logic changed to chose a specific node based off incoming relay, using the PocketNode struct
Merged in Feather Tendermint code with slight modifications and bug fixes, found here
Prometheus metrics are now being set with a validator_address tag
Add backwards compabaility of legacy Tendermint files when lean pocket is disabled
Add metric to keep track of the average time to generate the necessary data to generate a claim and proof
Added a less expensive relay call by modifying tendermint to expose IsCatchingUp without overhead of loading all validators

Misc changes

Removed some potential lock contention by introducing a RWLock
Fixed #1428 by submitting the correct UPoktEarned value

Design decisions

This proposal aims to be backward compatible with the current latest RC on the mainnet, keeping the same exact functionality if node runners decide to do so. Therefore, instead of completely modifying the relay and validator logic directly, we propose that we add additional functionality to both validators and services that are controlled through a feature flag in Pocket Config called lean_pocket (bool)
We kept a siloed in-memory session cache & evidence cache/DB for each node to prevent any lock contentions or observed behaviors
All additional functionality will be prefixed or suffixed with lean and can later be removed or integrated completely in future RCs.

Quick User Guide

Create a file inside your Pocket data dir (ex: /home/user/.pocket) called lean_nodes_keys.json and format the file to look like a JSON array of private keys as follows: [ { "priv_key": "6f641136790803051ea5b944113434017560d10e10a1b319b6a62146bc83ee361b5d51ada3a55ed716a6e614c4a1459f19ef838cf8e3b20d7c90553d27c9fb61" } ]
Set/Update your validators by running pocket accounts set-validators <path to lean_nodes_keys.json>
Add /modify lean_pocket to be true to your pocket configuration json file
Start Pocket Core 🚀

Suggestions/Tips:

Use --forceSetValidators as part of pocket start argument to ensure that your validators are updated on each pocket start. If you use this start argument, you don't have to use set-validators every-time you update your lean_node_keys
Do not use -- useCache as part of pocket start argument as it can delay restarting your full node

Testing

Added cache unit tests
Added relay with block production unit tests
Added Pocket node unit tests

Linked Issues #1428

Jul 29 '22 00:07 nodiesBlade

Please find the entire thread on the initial review: https://github.com/fiagdao/leanpocket/pull/4

Aug 01 '22 15:08 luyzdeleon

@PoktBlade please rebase

Aug 11 '22 16:08 oten91

Added two changes after rebasing:

Pocket update config cli command now defaults lean pocket to false (just incase)
Instead of implicit handling duplicate validator keys, we throw an error that there is a duplicate. As well, by doing so, we can preserve the original array so it is not shuffled around when setting the first peer. (Previously we used a map, which doesn't guarantee order)

Aug 15 '22 20:08 nodiesBlade

@PoktBlade Please check the comments, I'm not able to execute go build due to some problem on go.mod here and on the tendermint side. Let me know when resolved

Aug 22 '22 17:08 oten91

Hey @PoktBlade saw your new commit, now it builds but the go mod is not replacing with latest version of your tendermint fork (commit 88e7f9cae2721d471a77523ecba24f088466708a from fiagdao:leanpokt-v0.9.1). Let me if you need help with this.

Aug 24 '22 16:08 oten91

@oten91 Sorry, I should've updated. I had to pause and never finished the updates. I am almost sure I cannot have pokt-network as the module name for tendermint given mines lie in a different repository, but if you think otherwise, let me know. I will ping once I make the changes to at least run it

Aug 24 '22 16:08 nodiesBlade

All changes made. I couldn't make the tendermint fork module be named pokt-network though. Might be a fork speciifc thing.

Aug 24 '22 20:08 nodiesBlade

All changes made. I couldn't make the tendermint fork module be named pokt-network though. Might be a fork speciifc thing.

Don't Worry I handled it on an intermediary branch

Aug 24 '22 21:08 oten91

All changes made. I couldn't make the tendermint fork module be named pokt-network though. Might be a fork speciifc thing.

Don't Worry I handled it on an intermediary branch

Btw, I realized my pushes never went through for the resolved changes (it went to the wrong repo). It should be pushed now @oten91

Aug 25 '22 09:08 nodiesBlade

Hey @PoktBlade overall code looks good, passes the initial tests I'm running, but also I have already identified some bottlenecks due to the number of keys used (already started optimizing some on the tendermint side). As these are just performance optimization I'll be passing the torch to the QA team to fully test this build once merged to the integration branch.

@luyzdeleon Before merging, from the original spec I'm checking the Success Criteria and noticing some deliverables are missing, are you going to be delivering them later before the release/ maybe a different PR or whats the strategy?

I'm specially curious about the documentation part as we have some of it, like the quick user guide already on the description of this PR, but that might be forgotten as soon as the merge is done.

Success Criteria

[ ] A detailed report of the QA executed (as you well detailed in Monitoring/Testing strategy)
[x] Demonstrable benchmarks in relay processing speed, cross-checked with resource management indicating sustainable operational levels on both.
[x] End-To-End documentation: a finished version of this spec and guides on how to use the feature. (To be delivered later before the release #1469)

Aug 25 '22 15:08 oten91

@oten91 In terms of performance metrics:

Here's what we've seen with EPYC 7502P with 206 servicers in mainnet. Barely scratching 8% with 206 servicers. Thanks to the redundancy and abdundance of the network servicers (24 sessions), one servicer on mainnet averages around 11-20RPS unknown unknown-1

(we also have baselines for 1:50 - will get back to you on this)

It was important to us that one process would be able to handle as many relays as it can, avoiding lock contentions if it could be done easily. I also did some pprof benchmarking to ensure that relays are performant and even optimized this case to make it ~70% more performant by optimizing PCC to not load all validators into a struct per relay. This was due to the Tendermint status rpc method would return marshaled validators into the response, and the relay method used this function. So Tendermint was modified.

Before optimization unknown

After optimization unknown-1 We've also conducted multiple locusts benchmarking tests to ensure that relays are stays performant and have been able to saturate 1000 rps with 100 servicers with the above optimization (bottlenecked on our machine really) unknown Tying that back to how the mainnet performs, since there is only 10-20RPS per servicer mainnet, PCC in theory should support much more. Though, there are more factors to tie in such as lock contention, open files, etc.

Additionally, at one point I also modified pocket core to accept relays from "fake" apps on whitelisted chains for a brief period of time to ensure that relays per seconds were sustainable. However, I only conducted this a couple of times because it was cumbersome.

A user guide PR is in works on the feature and baseline suggestions. Overall, we don't support the idea of stacking too many servicers onto one node even though PCC supports it, and if they do, it is recommended they define their risk tolerance and suggest architecting their infrastructure in such a way that it has failover mechanisms to be fault tolerant. We also aim to lean on the side of not being granular machine specifications, given how many different combinations of specs (cpu architecture, drive type, etc) can perform and due to network parameters (nodes in session). But rather, we will provide baseline suggestions off our testing and let the community take it from there, just like how Pocket Core has evolved in terms of specifications.

Let me know if this is sufficent @oten91

Aug 25 '22 17:08 nodiesBlade

Hey @PoktBlade i would say it is sufficient for the performance and benchmarking part and I'm creating an issue to track the pending documentation task as part of the release, as we have new commands on the account namespace that need to be documented, so will say thats 2 / 3. For the QA part, more evidence is required, after looking at the functional tests one of the first things I tried to do was execute set-validators and starting with a malformed key file and got an unhandled panic.

This is a simple fix and I can fix it after merging but also uncovers that scenarios like this one might not be fully explored and tested, hence the need of some evidence of basic QA execution to share with the QA team, so they can have a good glimpse of the testing strategy from the developers.

More on the edge case found:

using a lean_nodes_key.json malformed like this :

[]

I got no error on the set-validators command but a panic on start.

pocket-core-testnet0_1  | 2022/08/25 17:51:07 Initializing Pocket Datadir
pocket-core-testnet0_1  | 2022/08/25 17:51:07 datadir = /home/app/.pocket
pocket-core-testnet0_1  | panic: runtime error: index out of range [0] with length 0
pocket-core-testnet0_1  |
pocket-core-testnet0_1  | goroutine 1 [running]:
pocket-core-testnet0_1  | github.com/pokt-network/pocket-core/app.SetValidatorsFilesLean({0x0?, 0x26?, 0xf6d598?})
pocket-core-testnet0_1  | 	/go/src/github.com/pokt-network/pocket-core/app/config.go:945 +0x84
pocket-core-testnet0_1  | github.com/pokt-network/pocket-core/app.InitApp({0x0?, 0x94610c?}, {0x0?, 0x9637e4?}, {0x0?, 0x4000c16a80?}, {0xffffd2124ddf?, 0x946588?}, {0x0?, 0x4000585ef0?}, ...)
pocket-core-testnet0_1  | 	/go/src/github.com/pokt-network/pocket-core/app/config.go:103 +0x2a8
pocket-core-testnet0_1  | github.com/pokt-network/pocket-core/app/cmd/cli.start(0xc0ba0b82c2cace6c?, {0x39ba645?, 0x16d4340?, 0x0?})
pocket-core-testnet0_1  | 	/go/src/github.com/pokt-network/pocket-core/app/cmd/cli/root.go:98 +0x160
pocket-core-testnet0_1  | github.com/pokt-network/pocket-core/app/cmd/cli.glob..func56(0x16c23e0?, {0x4000c382a0, 0x0, 0x3})
pocket-core-testnet0_1  | 	/go/src/github.com/pokt-network/pocket-core/app/cmd/cli/root.go:82 +0xf4
pocket-core-testnet0_1  | github.com/spf13/cobra.(*Command).execute(0x16c23e0, {0x4000c38240, 0x3, 0x3})
pocket-core-testnet0_1  | 	/go/pkg/mod/github.com/spf13/[email protected]/command.go:860 +0x4c4
pocket-core-testnet0_1  | github.com/spf13/cobra.(*Command).ExecuteC(0x16c2160)
pocket-core-testnet0_1  | 	/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x360
pocket-core-testnet0_1  | github.com/spf13/cobra.(*Command).Execute(...)
pocket-core-testnet0_1  | 	/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
pocket-core-testnet0_1  | github.com/pokt-network/pocket-core/app/cmd/cli.Execute()
pocket-core-testnet0_1  | 	/go/src/github.com/pokt-network/pocket-core/app/cmd/cli/root.go:45 +0x2c
pocket-core-testnet0_1  | main.main()
pocket-core-testnet0_1  | 	/go/src/github.com/pokt-network/pocket-core/app/cmd/pocket_core/main.go:8 +0x20

We might be relying on the file being correct from the get go after the set-validators and no double checking on start and probably in other spots during execution.

Aug 25 '22 18:08 oten91

To be clear, I'm not blocking the PR due to this simple edge case. I would like to see some basic QA materials and methodology before I can approve and move it forward. While there are not strict requirements for these materials, it's important that we have more evidence than mainnet benchmarking so we can avoid back and forth as much as possible on the QA phase.

Aug 25 '22 18:08 oten91

To be clear, I'm not blocking the PR due to this simple edge case. I would like to see some basic QA materials and methodology before I can approve and move it forward. While there are no strict requirements for these materials, it's important that we have more evidence than mainnet benchmarking so we can avoid back and forth as much as possible on the QA phase.

No problem understood. I think the current Q/A tests could account for what you did. I would like to minimize the amount of ping-ponging too. I would say my expertise is not on the QA side as much. Are you asking that we go through the written functional tests ourselves and run it? I am not sure what the ask is aside from fixing and adding ur scenario (if not encapsulated already)

Aug 25 '22 19:08 nodiesBlade

To be clear, I'm not blocking the PR due to this simple edge case. I would like to see some basic QA materials and methodology before I can approve and move it forward. While there are no strict requirements for these materials, it's important that we have more evidence than mainnet benchmarking so we can avoid back and forth as much as possible on the QA phase.

No problem understood. I think the current Q/A tests could account for what you did. I would like to minimize the amount of ping-ponging too. I would say my expertise is not on the QA side as much. Are you asking that we go through the written functional tests ourselves and run it? I am not sure what the ask is aside from fixing and adding ur scenario (if not encapsulated already)

To keep it simple i think thats a good way to do it, just go over each scenario described in lean-pocket-cli.feature (only this one as the others depend on the consensus/protocol rules and we would like to concentrate on things that can vary due to user input) and try to confirm scenarios have the expected results under different inputs.

Using the previous case as an example the definition of malformed can be ambiguous and that property uncovered a potential edge case / bug for 2 scenarios :

  Scenario: Lean Pocket is enabled but keyfile is malformed
    Given that the user has Pocket Network latest version installed.
    And user wants to import a list of keys
    When typing 'pocket accounts set-validators <path_to_list_of_keys>'
    Then user should see a failure message: "Failed to read validators json file <error>"

and

  Scenario: lean_pocket enabled And --forceSetValidators start arguments used but keyfile is malformed
    Given that the user has Pocket Network latest version installed.
    And user wants to import a list of keys
    When starting pocket core
    Then the node should log a failure message and stop
    Example:
      Failed to read validators json file <error>

Both failed for the edge case as the error was not handled when the file just has an empty array and instead returned a panic for the second one and no error for the first one. Empty Array [ ] should be part of the definition of malformed and properly handled or have its own special scenario and handling to avoid bugs.

On the current workflow the QA team will return this back and this should trigger either a change in code or a change in the scenarios definition or maybe the creation of more specialized scenarios for handling these cases that may arise and also changes in the code, all at the same time.

TLDR: a basic report of QA executed should contains a definition of what you are testing (scenario definition) and a proof of execution of the scenario (can be a screenshot or copy-paste the console input/output) with the expected results under different inputs if needed to validate the scenario. Do this for scenarios described in lean-pocket-cli.feature that depends on user input and we should be good to go.

For example on the edge case reported [ ] was not handled and fails but [ { } ] was handled as it should. A report from the execution of the scenario X returning an screenshot of the expected result ("Then user should see a failure message: "Failed to read validators json file <error>"") for both cases, validates that scenario and shows that the working code and the scenario are aligned to each other.

Aug 25 '22 23:08 oten91

Hey @oten91. Can see this document for the requested CLI QA completion and screenshots.

Aug 29 '22 00:08 addiaddiaddi

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/pup-26-mitigations-for-validator-downtime/3612/1

Oct 21 '22 18:10 POKT-Discourse

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/preproposal-poktfund-2022-leanpokt-and-security-vulnerability-reimbursement/3933/1

Feb 07 '23 09:02 POKT-Discourse

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/preproposal-poktfund-2022-leanpokt-and-security-vulnerability-reimbursement/3933/4

Feb 07 '23 17:02 POKT-Discourse

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/thunderhead-poktfund-leanpokt-proposal-reimbursement/4069/1

Mar 15 '23 19:03 POKT-Discourse

pocket-core pocket-core copied to clipboard

LeanPOKT

For transparency sakes:

Tendermint Branch

Disclaimer:

Name: ThunderHead (Pierre Spiegel & Addison Spiegel) | BaaS Pools LLC dba PoktFund (Erick Ho & Tommy Ho)

pocket-core
pocket-core copied to clipboard