amazon-cloudwatch-agent icon indicating copy to clipboard operation
amazon-cloudwatch-agent copied to clipboard

Reduce memory usage for prometheus plugins

Open khanhntd opened this issue 2 years ago • 3 comments

Description of the issue

After introducing save name before relabel and save instance, job before relabel. There has been an increase in memory consumption and also CPU cores since it adds extra relabel_configs and metric_relabel_config (200MB for normal case when using job, instance before relabel and 500MB for normal case without any other metric_relabel_config or relabel_config with more than 10,000 metrics). However, process for creating label for each metric within a scrape loop is a single thread.

Description of changes

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Note

  • Ensure backward compatibility for the current CWAgent and my PR
  • The number of metrics is quite small (<30). Therefore, it might not show the whole picture

Tests

For my PR

  • Without relabel name and instance and job: result in 50% decrease in CPU Utilization from start (with less memory utilization during runtime) and also 5% - 10% in memory consumption (2.05% maximum, average is around 1.95% but the minimum always down to 1.86%)
        global:
          scrape_interval: 1m
          scrape_timeout: 10s
        scrape_configs:
          - job_name: cwagent-ecs-file-sd-config
            sample_limit: 10000
            file_sd_configs:
              - files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ]

image

  • With relabel name and instance and job: mostly same pattern with current CWAgent for memory utilization (with lesser maximum memory utilization pattern); however, decrease 20% in CPU Utilization during initialization
       global:
          scrape_interval: 1m
          scrape_timeout: 10s
        scrape_configs:
          - job_name: cwagent-ecs-file-sd-config
            sample_limit: 10000
            file_sd_configs:
              - files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ]
            relabel_configs:
              - source_labels: [__address__]
                replacement: job_replacement
                target_label: job
            metric_relabel_configs:
              - source_labels: [ __address__ ]
                replacement: instance_replacement
                target_label: instance
              - source_labels: [__name__]
                regex: "memcached_commands_total(.*)"
                target_label: __name__
                replacement: "memcached_commands"

image image image

For the current CWAgent performance: image image image image

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make linter

khanhntd avatar Aug 19 '22 17:08 khanhntd

Codecov Report

Merging #568 (ca61afc) into master (88d45d2) will decrease coverage by 0.31%. The diff coverage is 52.17%.

@@            Coverage Diff             @@
##           master     #568      +/-   ##
==========================================
- Coverage   56.98%   56.66%   -0.32%     
==========================================
  Files         363      365       +2     
  Lines       16947    16993      +46     
==========================================
- Hits         9657     9629      -28     
- Misses       6739     6813      +74     
  Partials      551      551              
Impacted Files Coverage Δ
plugins/inputs/prometheus_scraper/calculator.go 0.00% <0.00%> (ø)
...ugins/inputs/prometheus_scraper/metrics_handler.go 0.00% <ø> (ø)
...ns/inputs/prometheus_scraper/prometheus_scraper.go 2.63% <0.00%> (+0.13%) :arrow_up:
plugins/inputs/prometheus_scraper/start.go 0.91% <0.00%> (+0.06%) :arrow_up:
...gins/inputs/prometheus_scraper/metrics_metadata.go 50.00% <50.00%> (ø)
...ins/inputs/prometheus_scraper/metric_prometheus.go 54.54% <54.54%> (ø)
...gins/inputs/prometheus_scraper/metrics_receiver.go 73.33% <73.33%> (-16.87%) :arrow_down:
plugins/inputs/prometheus_scraper/util.go 100.00% <100.00%> (ø)
translator/cmdutil/userutil_darwin.go 10.52% <0.00%> (ø)
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Aug 20 '22 05:08 codecov-commenter

I can't tell what the performance difference is, based on the screenshots though. Can you explain them more?

Sure and thanks too! As you already know, the OOM is caused because of saving metrics name before relabel. Therefore, with my PR, I bench marking with these test case:

  • The customer case without relabeling metrics name
  • Relabeling metrics name

For the first test case which is the customer's use case, I notice the following,

  • The CPU Utilization decreases from 2.15% to 1.11% (approximately 50% reduction when initialization CloudWatchAgent and small decrease during run time)
  • The Memory Utilization is within the range of 1.86% to 2.05% (1.95% in average) while comparing with the current CloudWatchAgent is within the range of 1.95% to 2.15% (average of 2.05%) - be note that this is only a small number of metrics so it might different in customer environment
  • Confirm with the customers that with my PR, their memory utilization is reduced from 9GB to 2.5GB (which addition 500MB since iirc, their original states is 3GB)

For the second test case which is why we introduce the PR that causes OOM, I notice the following

  • The CPU Utilization decreases from 2.15% to 1.70% and a small decrease during run time).
  • The Memory Utilization is within the range of 1.95% to 2.15% (1.95% in average but less pattern in 2.15%) - be also note that this is only a small number of metrics so it might different in customer environment.

** Note**: The decrease in CPU Utilization might help the most since when trouble shooting with customers, increase CPU cores make the customer's pod stables around 9GB (since relabel metric might be a single thread and increase CPU would help Prometheus to access the metrics in memory more faster)

khanhntd avatar Aug 24 '22 01:08 khanhntd

This PR was marked stale due to lack of activity.

github-actions[bot] avatar Sep 01 '22 00:09 github-actions[bot]

@khanhntd I'm wondering if there are any plans to get this PR updated and merged any time soon?

ashevtsov-wawa avatar May 05 '23 16:05 ashevtsov-wawa