F3 Network Metrics for Performance and Resilience
Open Grant Proposal: F3 Network Metrics for Performance and Resilience
Project Name: F3 Network Metrics for Performance and Resilience
Proposal Category: Developer and data tooling / Research & protocols
Individual or Entity Name: ProbeLab Analytics OĂś (https://probelab.io/)
Proposer: @yiannisbot
Project Repo(s)
- GH Org: https://github.com/orgs/probe-lab
- Other tools used:
- https://github.com/probe-lab/hermes
- https://github.com/probe-lab/tracecatcher
- https://github.com/probe-lab/ants-watch
- https://github.com/probe-lab/caracol
(Optional) Filecoin ecosystem affiliations:
The ProbeLab team has been part of Protocol Labs until January 2024.
(Optional) Technical Sponsor:
@masih (FilOz) @smagdali (FF)
Do you agree to open source all work you do on behalf of this RFP under the MIT/Apache-2 dual-license?: Yes.
Project Summary
After years in development, Filecoin Fast Finality (F3) has been merged and is live in the Filecoin network. The enormous improvement that F3 brings calls for a detailed set of metrics to make sure that the new finality mechanism works as expected, does not cause any abnormal behaviour when in steady state, but also during upgrades and works well alongside the traditional Expected Consensus (EC) finality.
The ProbeLab team has extensive experience in building performance monitoring tools for Web3 systems and is already producing a base set of metrics for Filecoin Weekly Network Health Reports, [Bootstrapper Uptime Monitoring(https://probelab.io/filecoin/bootstrappers/), DHT Keyspace Density, but also more detailed metrics for Gossipsub in particular Bandwidth Usage, Control Messages, Message Duplicates.
This project proposes the development of monitoring tooling for F3. The target is to develop tooling to monitor and publish a set of metrics for the most important concepts behind F3. In addition and in parallel to the development of these metrics, the team will study in detail the protocol configuration and make recommendations for performance improvements, where possible.
Impact
F3 is the biggest upgrade in Filecoin’s history and is a long-lived upgrade. It will always be part of the Filecoin network, and businesses building on fast finality will heavily rely on it. Therefore there is a need for long-term monitoring of F3 itself as part of the overall network heath.
The FilOz team is keeping track of the chain’s basic metrics, but a lot more is needed to guarantee the long-term healthy operation of F3 and the Filecoin chain. F3’s performance heavily relies on the participation of nodes in the network. Therefore, we need to proactively monitor node liveness and participation and take decisive action to rectify it, when needed. For this, accurate and continuous measurements are required. We are going to start from the following metrics that will give us an accurate view of the network and will help assess the chain health:
- Time it takes for an F3 instance to finalize.
- Distance of the latest F3 finalized tipset from the current chain head
- Nodes participating in F3 relative to the network size in terms of agent version breakdown. F3 only functions if at least 2/3 of the network participate. The committee in F3 consensus is the entire network. Therefore, it is crucial to proactively monitor participation breakdown by agent, in order to be able to take actions in rectifying F3 progress, should there be a lack of participation, e.g. due to misaligned incentives, high bandwidth usage, or effect on block rewards. This project’s results (i.e., continuous measurement and alerting) will inform F3’s engineering teams to take action, e.g., produce agent-specific patch releases for popular clients.
- Finality certificate exchange metrics:
- On average how caught up are nodes in the network relative to the latest available finality cert
- How responsive are nodes in pulling certificates from each other.
- This can be measured by randomly sampling nodes on the network and hitting cert exchange APIs.
- Gossipsub efficiency in terms of bandwidth consumption, starting from number of duplicate messages and extending to other metrics. These metrics are needed for a number of reasons:
- they will not only be valuable for the operation of F3 itself, but also for all other applications that are using Gossipsub in the context of Filecoin.
- With F3 live on the Filecoin mainnet, Gossipsub is being stretched as it is used for a far larger volume than it has ever handled since Filecoin mainnet launch. As things stand (i.e., no visibility into F3’s use of Gossipsub), we do not fully understand the long-term implications of this usage of Gossipsub and neither can we apply optimisations to relieve the system.
Absence of visibility into these metrics risk increased “forkiness” and as a result painful reorgs and a reputation hit for the Filecoin chain. For instance, F3 instances will not finish until at least 2/3 of the network participates. Monitoring the participation and liveness of nodes is, therefore, a critical metric that has to be continuously monitored in real-time and accompanied by the corresponding alerts. The project helps core devs to monitor the health of the chain, as well as spot very early on potential imminent forks.
Ultimately, the metrics that will be developed during this project will be important for any application building bridges or other DeFi products, as these metrics touch on several financial and legal implications. It is expected that the new wave of services on Filecoin will heavily rely on fast finality. Although there are graceful fallbacks to EC, should F3 not perform, the net effect on UX/DX would be affected significantly if fast finality suddenly stops performing or has sporadic performance.
Outcomes
- Continuous production and publication of the above metrics at the desired frequency, ranging from near-real time to daily, but also keeping track of longer term trends (e.g., week-on-week), as needed.
- Long-lived view of the distribution of distance from head for F3 finalized tipsets. This will make it possible for an external party to see how fast F3 really is/was going back 6 moths.
- Relevant issue at: https://github.com/filecoin-project/go-f3/issues/948
- Review and auditing of the Gossipsub parameters used for F3. Recommendations for more appropriate settings, where needed.
- Compatibility between F3 and EC metrics and development of metrics to assess “chain forkiness”.
Adoption, Reach, and Growth Strategies
As mentioned previously, the metrics that will be developed during this project will be important for any application building bridges or other DeFi products, as these metrics touch on several financial and legal implications.
Development Roadmap
Milestone 1: Tooling adaptation for F3 compatibility and deployment of monitoring infrastructure
Effort: 1.5PM
Funding: $45k
Staff: 1.5 engineers
ETA: 1 month from project kickoff (max)
Description: We will adapt our monitoring tooling to be F3-compatible so that we can extract the right data and traces from the network. As part of this Milestone, we will also set up the corresponding infrastructure and scripts to have the adapted tools run continuously, as needed. The tools we envision will be useful are:
- Hermes: a lightweight node that maintains connections to network nodes, subscribes to topics and collects valuable traces.
- Nebula: a network crawler to identify F3 node population and statistics.
- ants-watch: a DHT client monitoring tool. It is able to log the activity of all nodes in a DHT network by carefully placing ants in the DHT keyspace.
- Akai: a generic data availability sampler - we will likely use it for measuring certificate exchange metrics
Milestone 2: Metrics for F3
Effort: 2PMs
Funding: $60k
Staff: 2 engineers
ETA: 1-1.5 months after completion of Milestone 1
Description: We will use the monitoring tools adapted in Milestone 1 to track the metrics listed under the “Impact” section (listed again below). We will use a variety of techniques for this purpose (e.g., listening in to DECIDE messages on gpbft pubsub topic, listening in to the stream of finality certs and counting the proportion of nodes that have signed a cert), and we’ll adapt the methodology as needed.
- Time it takes for an F3 instance to finish.
- Distance of the latest finalized tipset from the chain head.
- Nodes participating in F3 relative to the network size.
- Finality certificate exchange metrics:
- On average how caught up are nodes in the network relative to the latest available finality cert.
- How responsive nodes are in pulling certificates from each other.
- This can be measured by randomly sampling nodes on the network and hitting cert exchange APIs.
- Gossipsub efficiency in terms of bandwidth consumption, starting from number of duplicate messages.
Milestone 3: Audit of Gossipsub Parameters & Final Report
Effort: 0.5PM
Funding: $15k
Staff: 1 engineer
ETA: 2 weeks after completion of Milestone 2
Description: Gossipsub is a complex protocol. It includes several interconnected functions that, when configured properly, can boost the performance of the network and protect it from undesired behaviour. However, the optimal settings and parameters of the protocol differ depending on the characteristics of the network.
In this Milestone, we’ll dive into the setup of the F3 network. We will review and audit the hardcoded Gossipsub parameters for the Lotus F3 particular setup and monitor its performance with regard to the number of duplicate messages, the bandwidth requirement of F3 nodes and their behaviour as “citizens” in the Gossipsub mesh. We will make recommendations regarding the optimal settings, depending on the results we will observe. Optionally, we will set up the right infrastructure to continuously monitor and publish metrics on the performance of Gossipsub in the F3 network.
Milestone 4: Maintenance, Infra costs and metrics for 1 year
Effort: 0.5PM
Funding: $15k
Staff: N/A - see description
ETA: N/A - see description
Description: This Milestone includes the continuous monitoring and production of plots and dashboards for 1 year. Work includes the maintenance and updating of tooling, but also the infrastructure used to run the tools, together with the costs involved. With funding from this Milestone, we commit to maintain and produce results for 1 year after the completion of the project. A separate contract will be needed after the end of that period in order to continue producing results for a second year (and beyond).
Total Budget Requested
| Milestone # | Description | Deliverables | Completion Date | Funding | | Milestone 1 | Tooling adaptation | Repository | 1 month (max) after kick off | $45k | | Milestone 2 | Metrics for F3 | Plots/Dashboards | 2-2.5 months after kick off | $60k | | Milestone 3 | Gossipsub Audit | Report | 3 months after kick off | $15k | | Milestone 4 | Metrics for 1 year | Plots/Dashboards | 1 year | $15k |
Total budget requested: $135k
Maintenance and Upgrade Plans
Maintenance, upgrade and infrastructure costs are included in Milestone 4 above.
Team
Team Members
Team Member 1: Yiannis Psaras, @yiannisbot (Team Lead) Team Member 2: Dennis Trautwein, @dennis-tra (Software Engineer) Team Member 3: Mikel Cortes, @cortze (Software Engineer) Team Member 4: Steph Samson, @kasteph (Infrastructure Engineer)
Team Member LinkedIn Profiles
Yiannis Psaras LinkedIn profile Dennis Trautwein LinkedIn profile Mikel Cortes LinkedIn profile
Team Website
https://probelab.io: Results, plots and dashboards https://probelab.network: Team & Services
Relevant Experience
The ProbeLab team has been part of Protocol Labs for multiple years (until January 2024) and has been focusing on monitoring and measurement studies for IPFS and libp2p-based networks for several years. The team has extensive experience in building tooling for monitoring, measurement, as well as the relevant infrastructure. Apart from the several metrics and tools that the team is maintaining and can be found at https://probelab.io/, the team has carried out detailed studies for both IPFS and libp2p. These studies can be found at: https://github.com/probe-lab/network-measurements/tree/master/results.
The team's expertise, set of clients and projects it has completed over the years, puts ProbeLab in a unique position with regard to the skillset and quality of results it can deliver. Illustratively, over the years, the team has worked on the following networks: Ethereum, Gnosis, Celestia, Polkadot, Filecoin, Avail, IPFS, Base and Optimism.
Most recently, the team has ran multiple projects to monitor the operation of Gossipsub in the Ethereum network. Here are a few sample reports that have resulted as part of that project:
- https://ethresear.ch/t/behind-the-scenes-of-ethereums-pectra-upgrade-a-data-driven-analysis/22665
- https://ethresear.ch/t/bandwidth-availability-in-ethereum-regional-differences-and-network-impacts/21138
- https://ethresear.ch/t/impact-of-idontwant-in-the-number-of-duplicates/22652/1
- https://ethresear.ch/t/gossipsub-network-dynamicity-through-grafts-and-prunes/19750
- https://ethresear.ch/t/gossip-iwant-ihave-effectiveness-in-ethereums-gossipsusb-network/19686
Team code repositories
- GH Org: https://github.com/orgs/probe-lab
- Other tools used:
- https://github.com/probe-lab/hermes
- https://github.com/probe-lab/tracecatcher
- https://github.com/probe-lab/ants-watch
- https://github.com/probe-lab/caracol
Additional Information
Contact email: [email protected]
For any further coordination/support from Filoz, please reach out to @BigLep.
Hi @yiannisbot, thank you for your proposal! We will be in touch with any questions or updates.
@FF-FOIT : this is a solid proposal from a team with a very solid track record. I believe this kind of monitoring, especially during it's first year after activation, is critical for understanding the impacts of F3 on the network, and for giving groups confidence to rely on F3. Let me or @kubuxu know if any input is needed to the foundation about this work from the perspective of "F3 implementers at FilOz".
@yiannisbot Since we are connecting in other threads, I have closed this item. Looking forward to proceeding with your work for F3 Network Metrics!