使用 prometheus 进行统计是否适合 SRS 的媒体流
在 使用 prometheus 的时候,我们场景是,对每个视频流,统计 bitrate, fps 等信息,实时采集信息,我们的 prometheus 是自己部署的,受限于单机的存储容量,然后用 granafa 进行展示。 使用中,发现数据量太大,prometheus 容易遇到性能上的瓶颈,想讨论一下, prometheus 在使用的过程中是否,只适合采集全集的信息,不适合 针对每个流进行状态的监控。
If you look at the example provided by Prometheus, Use Labels
To give you a better idea of the underlying numbers, let's look at node_exporter. node_exporter exposes
metrics for every mounted filesystem. Every node will have in the tens of timeseries for, say,
node_filesystem_avail. If you have 10,000 nodes, you will end up with roughly 100,000 timeseries for
node_filesystem_avail, which is fine for Prometheus to handle.
If you were to now add quota per user, you would quickly reach a double digit number of millions with
10,000 users on 10,000 nodes. This is too much for the current implementation of Prometheus. Even
with smaller numbers, there's an opportunity cost as you can't have other, potentially more useful
metrics on this machine any more.
He said that for the metric node_filesystem_avail, with ten thousand machines, each machine having ten data points, there would be a total of one hundred thousand points, which Prometheus can handle perfectly. But if you want to collect the quota of each of the ten thousand people on ten thousand machines, then there would be one hundred million data points, which is beyond the capacity.
Of course, what he meant was using labels instead of creating a separate metric for each stream. Typically, there are only a few dozen metrics, not hundreds or thousands.
As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed
that, aim to limit them to a handful across your whole system. The vast majority of your metrics should
have no labels.
For example, rather than http_responses_500_total and http_responses_403_total, create a single metric
called http_responses_total with a code label for the HTTP response code. You can then process the entire
metric as one in rules and graphs.
If you want to categorize metric indicators, you should use label tags. For example, instead of defining two metrics http_responses_500_total and http_responses_403_total, it should be one metric http_responses_total with an additional code label.
I don't know if you have performance issues with Prometheus. How many machines do you have? How many routes/streams? How are the metrics defined? How are the labels defined?
TRANS_BY_GPT3
This scenario:
- Thousands of streams, collecting data every 10 seconds.
- Collecting metrics for each playing stream, while adding several different labels for each stream.
- Label 1: Stream ID
- Label 2: Start time of the stream
- Label 3: End time of the stream
- Label 4: Other metrics of the stream...
- Prometheus deployed on a single machine, storing data for 15 days, and displayed using Grafana.
In this scenario, when aggregating data using Grafana, there may be performance issues when matching and filtering through different labels. For example, querying all data for a specific stream using the stream ID label, querying all streams within a certain time period, querying all streams with poor network conditions, or querying streams with frequent reconnections from the streaming source.
TRANS_BY_GPT3
The metrics of Prometheus are generally suitable for aggregation, such as start time and end time, which are not suitable to be stored in Prometheus. These are suitable to be stored in log systems like ELK or APM/Trace. After processing and filtering with these systems, they can also be displayed through Grafana. For more details, you can refer to this article Metrics, tracing, and logging.
Generally speaking, Prometheus belongs to Metrics, which means it is used for alerting and aggregates many metrics. Therefore, the data stored in Prometheus is relatively small. For example, if there is an issue with the flow, the alert should collect the error count metric of the flow and aggregate it into normal flows and abnormal flows across the network.
Querying flows within a specific time period or analyzing flows with poor network conditions is more of a task for data analysis tools like ELK or APM. These tools are part of the operations system and should not rely solely on alerts, Prometheus, or metrics. Relying solely on these tools can lead to excessive usage and high system load, resulting in slow query performance.
Add me on WeChat to chat? We are currently designing the official SRS exporter and welcome your participation.
TRANS_BY_GPT3
In general, if it's not a hundred thousand streams or a million plays, Prometheus is completely capable.
Currently, SRS has supported Prometheus Exporter, and we will continue to add new metrics. Please refer to #2899.
TRANS_BY_GPT3
Is there a conclusion yet? https://github.com/bluenviron/mediamtx#metrics This one has statistics for each flow. I don't know how many flows it can support for statistics.
TRANS_BY_GPT3
Update: For about 99% of use cases, which means virtually all scenarios, Prometheus can support stream-level monitoring data. SRS will gradually improve in the future.
TRANS_BY_GPT4