devgrants icon indicating copy to clipboard operation
devgrants copied to clipboard

F3 Network Metrics for Performance and Resilience

Open yiannisbot opened this issue 6 months ago • 3 comments

Open Grant Proposal: F3 Network Metrics for Performance and Resilience

Project Name: F3 Network Metrics for Performance and Resilience

Proposal Category: Developer and data tooling / Research & protocols

Individual or Entity Name: ProbeLab Analytics OĂś (https://probelab.io/)

Proposer: @yiannisbot

Project Repo(s)

  • GH Org: https://github.com/orgs/probe-lab
  • Other tools used:
    • https://github.com/probe-lab/hermes
    • https://github.com/probe-lab/tracecatcher
    • https://github.com/probe-lab/ants-watch
    • https://github.com/probe-lab/caracol

(Optional) Filecoin ecosystem affiliations:

The ProbeLab team has been part of Protocol Labs until January 2024.

(Optional) Technical Sponsor:

@masih (FilOz) @smagdali (FF)

Do you agree to open source all work you do on behalf of this RFP under the MIT/Apache-2 dual-license?: Yes.

Project Summary

After years in development, Filecoin Fast Finality (F3) has been merged and is live in the Filecoin network. The enormous improvement that F3 brings calls for a detailed set of metrics to make sure that the new finality mechanism works as expected, does not cause any abnormal behaviour when in steady state, but also during upgrades and works well alongside the traditional Expected Consensus (EC) finality.

The ProbeLab team has extensive experience in building performance monitoring tools for Web3 systems and is already producing a base set of metrics for Filecoin Weekly Network Health Reports, [Bootstrapper Uptime Monitoring(https://probelab.io/filecoin/bootstrappers/), DHT Keyspace Density, but also more detailed metrics for Gossipsub in particular Bandwidth Usage, Control Messages, Message Duplicates.

This project proposes the development of monitoring tooling for F3. The target is to develop tooling to monitor and publish a set of metrics for the most important concepts behind F3. In addition and in parallel to the development of these metrics, the team will study in detail the protocol configuration and make recommendations for performance improvements, where possible.

Impact

F3 is the biggest upgrade in Filecoin’s history and is a long-lived upgrade. It will always be part of the Filecoin network, and businesses building on fast finality will heavily rely on it. Therefore there is a need for long-term monitoring of F3 itself as part of the overall network heath.

The FilOz team is keeping track of the chain’s basic metrics, but a lot more is needed to guarantee the long-term healthy operation of F3 and the Filecoin chain. F3’s performance heavily relies on the participation of nodes in the network. Therefore, we need to proactively monitor node liveness and participation and take decisive action to rectify it, when needed. For this, accurate and continuous measurements are required. We are going to start from the following metrics that will give us an accurate view of the network and will help assess the chain health:

  • Time it takes for an F3 instance to finalize.
  • Distance of the latest F3 finalized tipset from the current chain head
  • Nodes participating in F3 relative to the network size in terms of agent version breakdown. F3 only functions if at least 2/3 of the network participate. The committee in F3 consensus is the entire network. Therefore, it is crucial to proactively monitor participation breakdown by agent, in order to be able to take actions in rectifying F3 progress, should there be a lack of participation, e.g. due to misaligned incentives, high bandwidth usage, or effect on block rewards. This project’s results (i.e., continuous measurement and alerting) will inform F3’s engineering teams to take action, e.g., produce agent-specific patch releases for popular clients.
  • Finality certificate exchange metrics:
    1. On average how caught up are nodes in the network relative to the latest available finality cert
    2. How responsive are nodes in pulling certificates from each other.
    3. This can be measured by randomly sampling nodes on the network and hitting cert exchange APIs.
  • Gossipsub efficiency in terms of bandwidth consumption, starting from number of duplicate messages and extending to other metrics. These metrics are needed for a number of reasons:
    1. they will not only be valuable for the operation of F3 itself, but also for all other applications that are using Gossipsub in the context of Filecoin.
    2. With F3 live on the Filecoin mainnet, Gossipsub is being stretched as it is used for a far larger volume than it has ever handled since Filecoin mainnet launch. As things stand (i.e., no visibility into F3’s use of Gossipsub), we do not fully understand the long-term implications of this usage of Gossipsub and neither can we apply optimisations to relieve the system.

Absence of visibility into these metrics risk increased “forkiness” and as a result painful reorgs and a reputation hit for the Filecoin chain. For instance, F3 instances will not finish until at least 2/3 of the network participates. Monitoring the participation and liveness of nodes is, therefore, a critical metric that has to be continuously monitored in real-time and accompanied by the corresponding alerts. The project helps core devs to monitor the health of the chain, as well as spot very early on potential imminent forks.

Ultimately, the metrics that will be developed during this project will be important for any application building bridges or other DeFi products, as these metrics touch on several financial and legal implications. It is expected that the new wave of services on Filecoin will heavily rely on fast finality. Although there are graceful fallbacks to EC, should F3 not perform, the net effect on UX/DX would be affected significantly if fast finality suddenly stops performing or has sporadic performance.

Outcomes

  • Continuous production and publication of the above metrics at the desired frequency, ranging from near-real time to daily, but also keeping track of longer term trends (e.g., week-on-week), as needed.
    • Long-lived view of the distribution of distance from head for F3 finalized tipsets. This will make it possible for an external party to see how fast F3 really is/was going back 6 moths.
    • Relevant issue at: https://github.com/filecoin-project/go-f3/issues/948
  • Review and auditing of the Gossipsub parameters used for F3. Recommendations for more appropriate settings, where needed.
  • Compatibility between F3 and EC metrics and development of metrics to assess “chain forkiness”.

Adoption, Reach, and Growth Strategies

As mentioned previously, the metrics that will be developed during this project will be important for any application building bridges or other DeFi products, as these metrics touch on several financial and legal implications.

Development Roadmap

Milestone 1: Tooling adaptation for F3 compatibility and deployment of monitoring infrastructure

Effort: 1.5PM

Funding: $45k

Staff: 1.5 engineers

ETA: 1 month from project kickoff (max)

Description: We will adapt our monitoring tooling to be F3-compatible so that we can extract the right data and traces from the network. As part of this Milestone, we will also set up the corresponding infrastructure and scripts to have the adapted tools run continuously, as needed. The tools we envision will be useful are:

  • Hermes: a lightweight node that maintains connections to network nodes, subscribes to topics and collects valuable traces.
  • Nebula: a network crawler to identify F3 node population and statistics.
  • ants-watch: a DHT client monitoring tool. It is able to log the activity of all nodes in a DHT network by carefully placing ants in the DHT keyspace.
  • Akai: a generic data availability sampler - we will likely use it for measuring certificate exchange metrics

Milestone 2: Metrics for F3

Effort: 2PMs

Funding: $60k

Staff: 2 engineers

ETA: 1-1.5 months after completion of Milestone 1

Description: We will use the monitoring tools adapted in Milestone 1 to track the metrics listed under the “Impact” section (listed again below). We will use a variety of techniques for this purpose (e.g., listening in to DECIDE messages on gpbft pubsub topic, listening in to the stream of finality certs and counting the proportion of nodes that have signed a cert), and we’ll adapt the methodology as needed.

  • Time it takes for an F3 instance to finish.
  • Distance of the latest finalized tipset from the chain head.
  • Nodes participating in F3 relative to the network size.
  • Finality certificate exchange metrics:
    1. On average how caught up are nodes in the network relative to the latest available finality cert.
    2. How responsive nodes are in pulling certificates from each other.
    3. This can be measured by randomly sampling nodes on the network and hitting cert exchange APIs.
  • Gossipsub efficiency in terms of bandwidth consumption, starting from number of duplicate messages.

Milestone 3: Audit of Gossipsub Parameters & Final Report

Effort: 0.5PM

Funding: $15k

Staff: 1 engineer

ETA: 2 weeks after completion of Milestone 2

Description: Gossipsub is a complex protocol. It includes several interconnected functions that, when configured properly, can boost the performance of the network and protect it from undesired behaviour. However, the optimal settings and parameters of the protocol differ depending on the characteristics of the network.

In this Milestone, we’ll dive into the setup of the F3 network. We will review and audit the hardcoded Gossipsub parameters for the Lotus F3 particular setup and monitor its performance with regard to the number of duplicate messages, the bandwidth requirement of F3 nodes and their behaviour as “citizens” in the Gossipsub mesh. We will make recommendations regarding the optimal settings, depending on the results we will observe. Optionally, we will set up the right infrastructure to continuously monitor and publish metrics on the performance of Gossipsub in the F3 network.

Milestone 4: Maintenance, Infra costs and metrics for 1 year

Effort: 0.5PM

Funding: $15k

Staff: N/A - see description

ETA: N/A - see description

Description: This Milestone includes the continuous monitoring and production of plots and dashboards for 1 year. Work includes the maintenance and updating of tooling, but also the infrastructure used to run the tools, together with the costs involved. With funding from this Milestone, we commit to maintain and produce results for 1 year after the completion of the project. A separate contract will be needed after the end of that period in order to continue producing results for a second year (and beyond).

Total Budget Requested

| Milestone # | Description | Deliverables | Completion Date | Funding | | Milestone 1 | Tooling adaptation | Repository | 1 month (max) after kick off | $45k | | Milestone 2 | Metrics for F3 | Plots/Dashboards | 2-2.5 months after kick off | $60k | | Milestone 3 | Gossipsub Audit | Report | 3 months after kick off | $15k | | Milestone 4 | Metrics for 1 year | Plots/Dashboards | 1 year | $15k |

Total budget requested: $135k

Maintenance and Upgrade Plans

Maintenance, upgrade and infrastructure costs are included in Milestone 4 above.

Team

Team Members

Team Member 1: Yiannis Psaras, @yiannisbot (Team Lead) Team Member 2: Dennis Trautwein, @dennis-tra (Software Engineer) Team Member 3: Mikel Cortes, @cortze (Software Engineer) Team Member 4: Steph Samson, @kasteph (Infrastructure Engineer)

Team Member LinkedIn Profiles

Yiannis Psaras LinkedIn profile Dennis Trautwein LinkedIn profile Mikel Cortes LinkedIn profile

Team Website

https://probelab.io: Results, plots and dashboards https://probelab.network: Team & Services

Relevant Experience

The ProbeLab team has been part of Protocol Labs for multiple years (until January 2024) and has been focusing on monitoring and measurement studies for IPFS and libp2p-based networks for several years. The team has extensive experience in building tooling for monitoring, measurement, as well as the relevant infrastructure. Apart from the several metrics and tools that the team is maintaining and can be found at https://probelab.io/, the team has carried out detailed studies for both IPFS and libp2p. These studies can be found at: https://github.com/probe-lab/network-measurements/tree/master/results.

The team's expertise, set of clients and projects it has completed over the years, puts ProbeLab in a unique position with regard to the skillset and quality of results it can deliver. Illustratively, over the years, the team has worked on the following networks: Ethereum, Gnosis, Celestia, Polkadot, Filecoin, Avail, IPFS, Base and Optimism.

Most recently, the team has ran multiple projects to monitor the operation of Gossipsub in the Ethereum network. Here are a few sample reports that have resulted as part of that project:

  • https://ethresear.ch/t/behind-the-scenes-of-ethereums-pectra-upgrade-a-data-driven-analysis/22665
  • https://ethresear.ch/t/bandwidth-availability-in-ethereum-regional-differences-and-network-impacts/21138
  • https://ethresear.ch/t/impact-of-idontwant-in-the-number-of-duplicates/22652/1
  • https://ethresear.ch/t/gossipsub-network-dynamicity-through-grafts-and-prunes/19750
  • https://ethresear.ch/t/gossip-iwant-ihave-effectiveness-in-ethereums-gossipsusb-network/19686

Team code repositories

  • GH Org: https://github.com/orgs/probe-lab
  • Other tools used:
    • https://github.com/probe-lab/hermes
    • https://github.com/probe-lab/tracecatcher
    • https://github.com/probe-lab/ants-watch
    • https://github.com/probe-lab/caracol

Additional Information

Contact email: [email protected]

yiannisbot avatar Jun 25 '25 10:06 yiannisbot

For any further coordination/support from Filoz, please reach out to @BigLep.

masih avatar Jun 25 '25 10:06 masih

Hi @yiannisbot, thank you for your proposal! We will be in touch with any questions or updates.

FF-FOIT avatar Jun 25 '25 18:06 FF-FOIT

@FF-FOIT : this is a solid proposal from a team with a very solid track record. I believe this kind of monitoring, especially during it's first year after activation, is critical for understanding the impacts of F3 on the network, and for giving groups confidence to rely on F3. Let me or @kubuxu know if any input is needed to the foundation about this work from the perspective of "F3 implementers at FilOz".

BigLep avatar Jun 26 '25 20:06 BigLep

@yiannisbot Since we are connecting in other threads, I have closed this item. Looking forward to proceeding with your work for F3 Network Metrics!

FF-FOIT avatar Oct 28 '25 14:10 FF-FOIT