Nodes metrics for speeds
As part of our ongoing effort to improve upload and download speeds in the network, we need better visibility into network performance across different regions. Currently, our data comes mainly from nodes running in Europe, which gives us a limited perspective. To optimize performance globally, we need input from node operators across different locations (e.g., China, Americas, etc.).
We are asking Swarm node operators to voluntarily share key metrics related to upload/download speeds. This will help us:
- Identify regional performance bottlenecks
- Understand the impact of node distribution on retrieval times
- Improve Swarm’s global data availability
First, we need to define what are the key values that we want to get from the nodes. Operators who choose to participate should provide information around the following areas:
- Geographical region (continent or country)
- Upload speed (average, max, min)
- Download speed (average, max, min)
- Latency to other nodes (if measurable)
- Time taken to retrieve a specific file (X MB in size) These values should be gathered over a period of at least a few days to account for variability in network conditions.
Node operators can opt-in to report the data and can opt-out anytime if they do not wish to share data.
This needs to be part of the 2.5 release in order to better assess the improvements.
To achieve the collection of metrics, we need the following:
- Currently, metrics from by the Bee client is accumulated from a Bee endpoint, meaning the metrics are pulled from the client. We need a separate mechanism that will push the metrics to a metrics collector endpoint, similar to how metrics are pushed by Beekeeper runtime to a central collector.
- The push mechanism can be enabled by a feature flag using the configuration options. The endpoint to which metrics will be pushed to should also be behind a config option, with the default pointing to a collector the Swarm org is running.
- The Prometheus collector system needs to be set up by the infra team, and exposed via a single endpoint.
For specific metrics, we need an investigation to see if the required metrics are already captured by the Bee client or we need to introduce new ones.
We also need to investigate how to transmit geographical region data as parts of the metrics and display the data in a presentable format. One possible solution to this may be that a middleware layer running as part of the collector system would translate the IP from incoming requests to a geographic region and transform the pushed metric to also include the region. More viable solutions may be needed.
As an opt-in feature, I support it. The downside of the opt-in approach is that the data that is received may be biased by the sources that in general wishes to participate that may share some other common aspects. The opt-out approach gives more diverse data, but I would not advocate for it as it violates privacy in a non-transparent way. Many projects choose the approach based on their core values, and for Swarm I think that opt-in is better, but the best would be not to have it at all.
Metrics
The metrics themselves as defined require also the time period taking the statistical calculations (for min, max, average). Alternatively, every upload and download speed and latencies ban be counted into carefully chosen buckets in Prometheus histograms, or other histogram-like structure if Prometheus is not used. That would provide more flexibility for interpretation, but less precise.
Geographical region
Geographical region can be determined by the IP address that connects to the metrics collector endpoint. This should be reliable enough if we assume that there are not so many nodes that use cross-region network proxies.
Prometheus
The easies way to get metrics is as Esad described a Prometheus metrics collector that is open to the internet. This requires an appropriate level of hardening and security of that endpoint.
This comes also with one caveat and that is the question of trust in the received metrics. Anyone can send bogus prometheus metrics, for purpose to make the data that we get dirty or for the purpose of the attach to the endpoint. I am not sure how other opensource projects solved that issue. I know that go has opt-in telemetry, but I am not familiar with technical aspects of it. VS Code, as well, and others.
Alternative metrics gathering approaches
One way to get metrics without exposing Prometheus to the internet is to utilize swarm's p2p protocols in a similar way the status protocol exposes some of the bee information. In that was a network scanner similar to the swamrscan should crawl the nodes to get the data. The downside is that we will not have data from the unreachable nodes. This is probably more complex approach and since other project use public endpoint approach, it is probably enough.
I am just raising awareness of pros and cons of different approaches, and probably, the most simple solution (Prometheus) is the best in our case, at least for start.
A completely different approach to this (as opposed to building this within bee somehow), would be having workloads run on cloud providers that are able to spin up ephemeral nodes, that then collect the required metrics, file their reports, and then power down.
The question is also "what is upload", and "what is download". Is this based on overall bandwidth use of libp2p, or retrieval / pushsync / pullsync ?
In general, I don't think that stats collection / injection into the client itself is the optimal. Let the client concentrate on pushing / pulling chunks around the place and measure from a targeted position would be my opinion (it also seems idiomatic to measure "what could a new client / user to the swarm expect").
@mfw78 while your proposal would provide baseline metrics, it wouldn't capture the diversity of real environments where nodes operate—the varying hardware, ISPs, NATS, etc.
I support the Prometheus approach at least for a start.
Just to mention another tool for the job https://opentelemetry.io/.
Just to mention another tool for the job https://opentelemetry.io/.
Since we are using the OpenTracing library that is an archived project we might switch to OpenTelemetry as they also suggest on their main OpenTracing webpage.
My suggestion would be that metrics collection is put in a separate app or maybe something like bee-dashboard. This could scrape any of the existing metrics from bee /metrics endpoint and push them to a prometheus pushgateway server where its then collected by an prometheus instance. Additional to the bee metrics we could add some info about the configuration, like cpu, ram, rpc endpoint. And additional to that the app could do a 10megabytes file download test, to measure failure rates and speed. we should probably use one of the references that are used for sla metrics We should probably not collect all the metrics from /metrics endpoint but just performance or error related ones as multiplied by many bees on the network the storage of such data would be quite significant. The biggest challenge here would how such an separate app would discover or make it as easy as possible for operators who run many nodes to collect from all their bees. on containerization platform such as docker or kubernetes it should be fairly straight forward all though such automated method would probably require elevated permissions.
suggested metrics:
bee_localstore_method_calls_duration_bucket
bee_retrieval_request_attempts_bucket
bee_retrieval_retrieve_chunk_time_bucket
bee_libp2p_headers_exchange_duration_bucket
bee_transaction_method_duration_bucket
bee_retrieval_request_duration_time_bucket
bee_retrieval_request_duration_time_count
bee_retrieval_request_duration_time_sum
Whose is the next step on this issue? Waiting for anything from DevOps? @nikipapadatou ?