performance-analyzer
performance-analyzer copied to clipboard
[RFC] Migrating Metrics From Performance Analyzer to OpenTelemetry Framework
Introduction
This RFC proposes migrating metrics from the OpenSearch Performance Analyzer (a plugin designed to gather system and application-level metrics) to OpenTelemetry framework, in light of the recent integration of OpenTelemetry as a trace/metrics collector within OpenSearch, and eventually deprecate the Performance Analyzer plugin.
Background
OpenSearch Performance Analyzer has been a valuable plugin within OpenSearch, offering insights into system and application level performance. With the advancement in observability frameworks and the community's move towards standardization, OpenSearch has integrated OpenTelemetry as a metrics collector, we are now presented with an opportunity to streamline our metrics collection workflow and framework and improve the maintainability and performance of the metrics collection workflow.
Motivation
- Unified Metrics Collection: The integration of OpenTelemetry provides us a comprehensive metrics collection framework that can potentially replace the functionality of Performance Analyzer. Consolidating our metrics collection tools will simplify the architecture and reduce the complexity of our system.
- Reduce Maintenance Overhead: Maintaining two metrics collection tools is resource-intensive. PA is using a metric collection framrowk built by ourselves to collect metrics and it's not industry standard. By focusing our efforts on a single framework (OpenTelemetry), we can ensure that we provide the best possible support and updates.
- Community Adoption: OpenTelemetry has gained significant traction in the community, leading to more integrations, tools, and extensions that our users can benefit from.
- Performance: OpenTelemetry is a widely-adopted project with optimizations and improvements being made continuously. Leveraging its capabilities can potentially offer better performance and resource utilization compared to maintaining our custom solution (PA/RCA).
Proposal
- Deprecation Notice: We can begin by adding a deprecation notice on the Performance Analyzer's README and documentation. Inform users about the planned deprecation and the timeline for discontinuing support.
- Migration Plan: Come up with a detailed migration plan which covers:
- What are the different types of metrics we collect in Performance Analyzer
- For each of the category, how to get the exact same metrics previously gathered by Performance Analyzer using OpenTelemetry.
- For the downstream components that consume PA metrics, how to maintain the consistency.
- Run the new metrics system as shadow mode for some time (?).
- Deprecation: After we are confident of the new metrics collection workflow, , officially deprecate the Performance Analyzer.
- Stopping active development and support.
- Archiving the repository or clearly marking it as deprecated.
- Removal: In a subsequent major release of OpenSearch, completely remove the Performance Analyzer from the codebase and documentation.
Appendix
Categories of PA (the plugin) Collectors
- Host level metrics: collected by directly reading the host/node level metrics.
- Service level metrics: collected directly from OpenSearch application, it uses the OpenSearchResource object with is created when the PA plugin is loaded and contains the OpenSearch related data like threadPool, environment, indicesService etc.
- Metrics with reflect: involve using java reflection to get metrics from a library
- JVM level metrics: collected from JVM directly by using GarbageCollectorMXBean etc.
- Service level metrics with API: collected by calling an API.
- PA internal metrics: Collects internal metrics from PA/RCA framework, not related to OpenSearch Core.
Collector Name | Type | Details: How are metrics collected | migrate to ..? | Feasible or not? |
---|---|---|---|---|
OSMetricsCollector | Host level metrics | Several customized data generator are created to gather CPU, Disk, Scheduling related metrics by reading the "/proc/OSMetricsCollector and forward to the Json file in shared Memory. |
Other agent outside of OpenSearch process /OPTL collector | Feasible |
DisksCollector | Host level metrics | Customized data generator are created to gather Disk related metrics by reading the "/proc/diskstats" files in a blocking way. The metrics are then gathered by the DiskCollector and forward to the Json file in shared Memory. |
Other agent outside of OpenSearch process/OPTL collector | Feasible |
NetworkInterfaceCollector | Host level metrics | Customized data generator are created to gather Network related metrics by reading the "/proc/net/snmp, /prov/net/snmp6, /proc/net/dev" files in a blocking way. The metrics are then gathered by the NetworkInterfaceCollector and forward to the Json file in shared Memory. |
Other agent outside of OpenSearch process/OPTL collector | Feasible |
HeapMetricsCollector | JVM level metrics | Utilize the GarbageCollectorMXBean and MemoryMXBean in java.lang.management library to get metrics related to JVM | Core | Feasible |
GCInfoCollector | JVM level metrics | get GC related info from GarbageCollectorMXBeans | Core | Feasible |
CircuitBreakerCollector | Service level metrics | from circuitBreakerService passed from OpenSearch | Core | Feasible. |
NodeDetailsCollector | Service level metrics | from clusterService passed from OpenSearch | Core | Feasible |
ClusterManagerServiceMetrics | Service level metrics | get the pending tasks stats from clusterService.clusterManagerService | Core | Feasible |
ShardStateCollector | Service level metrics | get shard state metrics for each shard in each index using the routingTable data within the clusterService passed from OpenSearch |
Core | Feasible, but need to check the CPU level metrics comming from threads. |
ElectionTermCollector | Service level metrics | Get election term metric from clusterService passed from OpenSearch | Core | Feasible |
ThreadPoolMetricsCollector | Service level metrics (with reflection) | Metrics are get from calling the stats() function on the threadpool object passed from OpenSearch. we use Java reflection to get the capacity of the threadpool |
Core | Feasible. Migrating to core means we can directly send threadpool level metrics without using reflection. |
CacheConfigMetricsCollector | Service level metrics (with reflection) | from indicesService passed from OpenSearch, use Java reflection to ensure backward compatibility. The indicesService is provided by DI and the binding is defined here | Core | Feasible. |
NodeStatsAllShardsMetricsCollector | Service level metrics (with reflection) | from indicesService passed from OpenSearch, get the increment of the high level stats for all shards by calculating the diff between the previous shard stats | Core | Feasible |
NodeStatsFixedShardsMetricsCollector | Service level metrics (with reflection) | Similar to NodeStatsAllShardsMetricsCollector, from indicesService passed from OpenSearch, get more detailed metrics for some specified shards passed by the user with shardsPerCollection | Core | Feasible |
ClusterManagerServiceEventMetrics | Service level metrics (with reflection) | get cluster manager task event data from the clusterManagerService Object passed from OpenSearch | Core | Feasible |
ClusterManagerThrottlingMetricsCollector | Service level metrics (with reflection) | get throttling metrics from the reflect of org.opensearch.action.support.clustermanager.ClusterManagerThrottlingRetryListener, from the clusterService passed from OpenSearch | Core | Feasible |
ClusterApplierServiceStatsCollector | Service level metrics (with reflection) | "ClusterApplierServiceStats is ES is a tracker for total time taken to apply cluster state and thenumber of times it has failed". This collector uses the ClusterApplierService from opensearch. | Core | Feasible |
AdmissionControlMetricsCollector | Service level metrics (with reflection) | Use the admissionController from com.sonian.opensearch.http.jetty.throttling.JettyAdmissionControlService in OpenSearch. Get AdmissionControl related metrics. | Core | Feasible |
ShardIndexingPressureMetricsCollector | Service level metrics (with reflection) | Get Index pressure related metrics, from clusterService passed from OpenSearch. Using classes like org.opensearch.index.ShardIndexingPressureStore, org.opensearch.index.IndexingPressure, org.opensearch.index.ShardIndexingPressure classes from clusterService | Core | Feasible |
FaultDetectionMetricsCollector | PA internal metrics | PA internal queue fault metrics? Get the FaultDetectionHandlerMetricsQueue from org.opensearch.performanceanalyzer.handler.ClusterFaultDetectionStatsHandler and emit metrics based on each entry. | Deprecate | Feasible |
StatsCollector | PA internal metrics | PA internal metrics stats collector | deprecate | Feasible |
@ansjcy, thanks for putting this up. Utilizing the OpenSearch telemetry framework for emitting these metrics does seem promising. The PA plugin generators are already well-written, making them easily reusable. Since these metrics are ideally part of a plugin rather than being merged directly into the core, migrating them to the OpenSearch telemetry framework within the PA plugin sounds like a sensible approach.
thoughts here @reta @backslasht @msfroh @khushbr @Bukhtawar
Agree with @Gaganjuneja , the OpenSearch already collects tons of metrics but exposes them through REST APIs, using the newly developed metric providers, we certainly could unify the approach. Thanks @ansjcy !
+1, I like the idea of migrating the Performance Analyzer plugin metrics into the OpenTelemetry format.
But, would like to understand bit more on deprecation of "Performance Analyzer" plugin part.
- are you suggesting to move the logic into a new plugin which will emit these metrics in OTel format and once that is done deprecate "Performance Analyzer" plugin OR
- are you suggesting to move the metrics collection into core?
Thank you, @reta and @backslasht, for your prompt responses. My suggestion is to retain these metrics within the "Performance Analyzer" plugin for the time being, given its extensive collection of operating system metrics. To facilitate this, we can pass the MetricsRegistry from the core to the Performance Analyzer plugin and initiate the migration of metrics to utilize an OpenTelemetry-based metrics registry for publishing purposes. Eventually, we can deliberate on the feasibility of integrating this plugin entirely into the core, taking into consideration the implications of backporting as well.