amazon-cloudwatch-agent
amazon-cloudwatch-agent copied to clipboard
Reduce memory usage for prometheus plugins
Description of the issue
After introducing save name before relabel and save instance, job before relabel. There has been an increase in memory consumption and also CPU cores since it adds extra relabel_configs
and metric_relabel_config
(200MB for normal case when using job, instance before relabel and 500MB for normal case without any other metric_relabel_config or relabel_config with more than 10,000 metrics). However, process for creating label for each metric within a scrape loop is a single thread.
Description of changes
- Dropping unknown type for internal metric for avoid increase metrics processing
- Instead of saving
relabel
metrics for each Prometheus Metric, moving looking for metric type when appending metrics into batch -
Utilize metadata in context for OpenTelemetry for better memory consumption (since the context for holding metadata store is created within each Scrape Loop and is garbaged collect afterwards). For previous behavior, we have to use
relabel_config
to save job and instance before relabel to look up metadata type - Search up for metadataStore in context only once during the entire batch
- Only use metric relabel name when target label is name
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Note
- Ensure backward compatibility for the current CWAgent and my PR
- The number of metrics is quite small (<30). Therefore, it might not show the whole picture
Tests
For my PR
- Without relabel name and instance and job: result in 50% decrease in CPU Utilization from start (with less memory utilization during runtime) and also 5% - 10% in memory consumption (2.05% maximum, average is around 1.95% but the minimum always down to 1.86%)
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: cwagent-ecs-file-sd-config
sample_limit: 10000
file_sd_configs:
- files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ]
- With relabel name and instance and job: mostly same pattern with current CWAgent for memory utilization (with lesser maximum memory utilization pattern); however, decrease 20% in CPU Utilization during initialization
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: cwagent-ecs-file-sd-config
sample_limit: 10000
file_sd_configs:
- files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ]
relabel_configs:
- source_labels: [__address__]
replacement: job_replacement
target_label: job
metric_relabel_configs:
- source_labels: [ __address__ ]
replacement: instance_replacement
target_label: instance
- source_labels: [__name__]
regex: "memcached_commands_total(.*)"
target_label: __name__
replacement: "memcached_commands"
For the current CWAgent performance:
Requirements
Before commit the code, please do the following steps.
- Run
make fmt
andmake fmt-sh
- Run
make linter
Codecov Report
Merging #568 (ca61afc) into master (88d45d2) will decrease coverage by
0.31%
. The diff coverage is52.17%
.
@@ Coverage Diff @@
## master #568 +/- ##
==========================================
- Coverage 56.98% 56.66% -0.32%
==========================================
Files 363 365 +2
Lines 16947 16993 +46
==========================================
- Hits 9657 9629 -28
- Misses 6739 6813 +74
Partials 551 551
Impacted Files | Coverage Δ | |
---|---|---|
plugins/inputs/prometheus_scraper/calculator.go | 0.00% <0.00%> (ø) |
|
...ugins/inputs/prometheus_scraper/metrics_handler.go | 0.00% <ø> (ø) |
|
...ns/inputs/prometheus_scraper/prometheus_scraper.go | 2.63% <0.00%> (+0.13%) |
:arrow_up: |
plugins/inputs/prometheus_scraper/start.go | 0.91% <0.00%> (+0.06%) |
:arrow_up: |
...gins/inputs/prometheus_scraper/metrics_metadata.go | 50.00% <50.00%> (ø) |
|
...ins/inputs/prometheus_scraper/metric_prometheus.go | 54.54% <54.54%> (ø) |
|
...gins/inputs/prometheus_scraper/metrics_receiver.go | 73.33% <73.33%> (-16.87%) |
:arrow_down: |
plugins/inputs/prometheus_scraper/util.go | 100.00% <100.00%> (ø) |
|
translator/cmdutil/userutil_darwin.go | 10.52% <0.00%> (ø) |
|
... and 3 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I can't tell what the performance difference is, based on the screenshots though. Can you explain them more?
Sure and thanks too! As you already know, the OOM is caused because of saving metrics name before relabel. Therefore, with my PR, I bench marking with these test case:
- The customer case without relabeling metrics name
- Relabeling metrics name
For the first test case which is the customer's use case, I notice the following,
- The CPU Utilization decreases from 2.15% to 1.11% (approximately 50% reduction when initialization CloudWatchAgent and small decrease during run time)
- The Memory Utilization is within the range of 1.86% to 2.05% (1.95% in average) while comparing with the current CloudWatchAgent is within the range of 1.95% to 2.15% (average of 2.05%) - be note that this is only a small number of metrics so it might different in customer environment
- Confirm with the customers that with my PR, their memory utilization is reduced from 9GB to 2.5GB (which addition 500MB since iirc, their original states is 3GB)
For the second test case which is why we introduce the PR that causes OOM, I notice the following
- The CPU Utilization decreases from 2.15% to 1.70% and a small decrease during run time).
- The Memory Utilization is within the range of 1.95% to 2.15% (1.95% in average but less pattern in 2.15%) - be also note that this is only a small number of metrics so it might different in customer environment.
** Note**: The decrease in CPU Utilization might help the most since when trouble shooting with customers, increase CPU cores make the customer's pod stables around 9GB (since relabel metric might be a single thread and increase CPU would help Prometheus to access the metrics in memory more faster)
This PR was marked stale due to lack of activity.
@khanhntd I'm wondering if there are any plans to get this PR updated and merged any time soon?