Description of the issue

After introducing save name before relabel and save instance, job before relabel. There has been an increase in memory consumption and also CPU cores since it adds extra relabel_configs and metric_relabel_config (200MB for normal case when using job, instance before relabel and 500MB for normal case without any other metric_relabel_config or relabel_config with more than 10,000 metrics). However, process for creating label for each metric within a scrape loop is a single thread.

Description of changes

Dropping unknown type for internal metric for avoid increase metrics processing
Instead of saving relabel metrics for each Prometheus Metric, moving looking for metric type when appending metrics into batch
Utilize metadata in context for OpenTelemetry for better memory consumption (since the context for holding metadata store is created within each Scrape Loop and is garbaged collect afterwards). For previous behavior, we have to use relabel_config to save job and instance before relabel to look up metadata type
Search up for metadataStore in context only once during the entire batch
Only use metric relabel name when target label is name

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Note

Ensure backward compatibility for the current CWAgent and my PR
The number of metrics is quite small (<30). Therefore, it might not show the whole picture

Tests

For my PR

Without relabel name and instance and job: result in 50% decrease in CPU Utilization from start (with less memory utilization during runtime) and also 5% - 10% in memory consumption (2.05% maximum, average is around 1.95% but the minimum always down to 1.86%)

        global:
          scrape_interval: 1m
          scrape_timeout: 10s
        scrape_configs:
          - job_name: cwagent-ecs-file-sd-config
            sample_limit: 10000
            file_sd_configs:
              - files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ]

With relabel name and instance and job: mostly same pattern with current CWAgent for memory utilization (with lesser maximum memory utilization pattern); however, decrease 20% in CPU Utilization during initialization

       global:
          scrape_interval: 1m
          scrape_timeout: 10s
        scrape_configs:
          - job_name: cwagent-ecs-file-sd-config
            sample_limit: 10000
            file_sd_configs:
              - files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ]
            relabel_configs:
              - source_labels: [__address__]
                replacement: job_replacement
                target_label: job
            metric_relabel_configs:
              - source_labels: [ __address__ ]
                replacement: instance_replacement
                target_label: instance
              - source_labels: [__name__]
                regex: "memcached_commands_total(.*)"
                target_label: __name__
                replacement: "memcached_commands"

For the current CWAgent performance:

Requirements

Before commit the code, please do the following steps.

Run make fmt and make fmt-sh
Run make linter

Aug 19 '22 17:08 khanhntd

Codecov Report

Merging #568 (ca61afc) into master (88d45d2) will decrease coverage by 0.31%. The diff coverage is 52.17%.

@@            Coverage Diff             @@
##           master     #568      +/-   ##
==========================================
- Coverage   56.98%   56.66%   -0.32%     
==========================================
  Files         363      365       +2     
  Lines       16947    16993      +46     
==========================================
- Hits         9657     9629      -28     
- Misses       6739     6813      +74     
  Partials      551      551

Impacted Files	Coverage Δ
plugins/inputs/prometheus_scraper/calculator.go	`0.00% <0.00%> (ø)`
...ugins/inputs/prometheus_scraper/metrics_handler.go	`0.00% <ø> (ø)`
...ns/inputs/prometheus_scraper/prometheus_scraper.go	`2.63% <0.00%> (+0.13%)`	:arrow_up:
plugins/inputs/prometheus_scraper/start.go	`0.91% <0.00%> (+0.06%)`	:arrow_up:
...gins/inputs/prometheus_scraper/metrics_metadata.go	`50.00% <50.00%> (ø)`
...ins/inputs/prometheus_scraper/metric_prometheus.go	`54.54% <54.54%> (ø)`
...gins/inputs/prometheus_scraper/metrics_receiver.go	`73.33% <73.33%> (-16.87%)`	:arrow_down:
plugins/inputs/prometheus_scraper/util.go	`100.00% <100.00%> (ø)`
translator/cmdutil/userutil_darwin.go	`10.52% <0.00%> (ø)`
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Aug 20 '22 05:08 codecov-commenter

I can't tell what the performance difference is, based on the screenshots though. Can you explain them more?

Sure and thanks too! As you already know, the OOM is caused because of saving metrics name before relabel. Therefore, with my PR, I bench marking with these test case:

The customer case without relabeling metrics name
Relabeling metrics name

For the first test case which is the customer's use case, I notice the following,

The CPU Utilization decreases from 2.15% to 1.11% (approximately 50% reduction when initialization CloudWatchAgent and small decrease during run time)
The Memory Utilization is within the range of 1.86% to 2.05% (1.95% in average) while comparing with the current CloudWatchAgent is within the range of 1.95% to 2.15% (average of 2.05%) - be note that this is only a small number of metrics so it might different in customer environment
Confirm with the customers that with my PR, their memory utilization is reduced from 9GB to 2.5GB (which addition 500MB since iirc, their original states is 3GB)

For the second test case which is why we introduce the PR that causes OOM, I notice the following

The CPU Utilization decreases from 2.15% to 1.70% and a small decrease during run time).
The Memory Utilization is within the range of 1.95% to 2.15% (1.95% in average but less pattern in 2.15%) - be also note that this is only a small number of metrics so it might different in customer environment.

** Note**: The decrease in CPU Utilization might help the most since when trouble shooting with customers, increase CPU cores make the customer's pod stables around 9GB (since relabel metric might be a single thread and increase CPU would help Prometheus to access the metrics in memory more faster)

Aug 24 '22 01:08 khanhntd

This PR was marked stale due to lack of activity.

Sep 01 '22 00:09 github-actions[bot]

@khanhntd I'm wondering if there are any plans to get this PR updated and merged any time soon?

May 05 '23 16:05 ashevtsov-wawa

amazon-cloudwatch-agent
amazon-cloudwatch-agent copied to clipboard

Reduce memory usage for prometheus plugins

Description of the issue

Description of changes

License

Note

Tests

Requirements

Codecov Report

amazon-cloudwatch-agent amazon-cloudwatch-agent copied to clipboard

Reduce memory usage for prometheus plugins

Description of the issue

Description of changes

License

Note

Tests

Requirements

Codecov Report

amazon-cloudwatch-agent
amazon-cloudwatch-agent copied to clipboard