pi-cluster Memory footprint optimization

Right now the PODs consuming the most resources are ES, Kibana and Prometheus:

Top PODs by memory

kubectl top pods -A --sort-by='memory'
NAMESPACE            NAME                                                        CPU(cores)   MEMORY(bytes)   
k3s-logging          efk-es-default-0                                            165m         1432Mi          
k3s-monitoring       prometheus-kube-prometheus-stack-prometheus-0               335m         1168Mi          
k3s-logging          efk-kb-66f7cdc8bd-hc4mb                                     222m         544Mi           
k3s-monitoring       kube-prometheus-stack-grafana-645d765844-m4q5r              55m          207Mi           
k3s-logging          fluentd-69bff468c6-f9dgc                                    37m          133Mi           
linkerd              linkerd-destination-787d745598-jh6pj                        9m           69Mi            
longhorn-system      longhorn-manager-wmpx4                                      32m          64Mi            
longhorn-system      longhorn-manager-wmpj8                                      20m          62Mi            
longhorn-system      longhorn-manager-r22m9                                      27m          57Mi            
velero-system        velero-574c749cb6-tlwwl                                     4m           52Mi            
certmanager-system   certmanager-cert-manager-cainjector-78cd7b475-qc7vm         3m           50Mi            
longhorn-system      instance-manager-r-5be28d03                                 34m          50Mi            
longhorn-system      instance-manager-r-d15d9bab                                 33m          49Mi

Top PODs by cpu

kubectl top pods -A --sort-by='cpu'
NAMESPACE            NAME                                                        CPU(cores)   MEMORY(bytes)   
k3s-monitoring       prometheus-kube-prometheus-stack-prometheus-0               236m         1199Mi          
k3s-logging          efk-kb-66f7cdc8bd-hc4mb                                     226m         544Mi           
k3s-logging          efk-es-default-0                                            144m         1434Mi          
longhorn-system      instance-manager-e-6cc59123                                 81m          30Mi            
longhorn-system      instance-manager-e-aa8e1208                                 67m          28Mi            
longhorn-system      instance-manager-r-04a02b6c                                 36m          38Mi            
longhorn-system      instance-manager-r-d15d9bab                                 34m          48Mi            
longhorn-system      instance-manager-r-5be28d03                                 33m          50Mi            
longhorn-system      longhorn-manager-wmpx4                                      29m          65Mi            
k3s-logging          fluentd-69bff468c6-f9dgc                                    28m          133Mi           
kube-system          metrics-server-668d979685-rdgcn                             25m          23Mi            
k3s-monitoring       kube-prometheus-stack-grafana-645d765844-m4q5r              24m          203Mi           
longhorn-system      longhorn-manager-wmpj8                                      24m          64Mi

Some changes to try

[x] For ES/Kibana try to adjust the ammount of JVM heap assigned following the procedure in the documentation (https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html).
[ ] For Prometheus, increase scraping interval of Linkerd metrics, currently 10 seconds to the default 30 sg.
[ ] Review metrics: remove duplicates

Aug 08 '22 09:08 ricsanfre

About linkerd grafana dashboards

Grafana dashboards are using Prometheus' irate function using a interval of 30sg (irate(<metric>)[30sg]). irate function calculates the per-second instant rate of increase of the time series taking only the last two samples in the interval. This is not possible if the scrape interval is increase to 30sg.

Dashboards need to be modified. Maybe Grafana's variable $__rate_interval can be used here.

Aug 09 '22 14:08 ricsanfre

Applying resource limits to ElasticSearch

According to ECK documentation, "The heap size of the JVM is automatically calculated based on the node roles and the available memory. The available memory is defined by the value of resources.limits.memory set on the elasticsearch container in the Pod template, or the available memory on the Kubernetes node is no limit is set"

By default, the resources.limits.memory for ElasticSearch is 2GB

Reducing the limit to 1GB

---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: efk
  namespace: k3s-logging
spec:
  version: 8.1.3
  http:    # Making elasticsearch service available from outisde the cluster
    tls:
      selfSignedCertificate:
        disabled: true
  nodeSets:
    - name: default
      count: 1
      config:
        node.store.allow_mmap: false
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 5Gi
            storageClassName: longhorn

      podTemplate:
        spec:
          # Limiting Resources consumption
          containers:
          - name: elasticsearch
            resources:
              requests:
                memory: 1Gi
              limits:
                memory: 1Gi

The results is that elasticsearch memory consumption is reduced by 500Mb

Aug 13 '22 15:08 ricsanfre

About increasing scraping interval for linkerd metrics

Using default 30sg,instead of 10 sg, only has no effects on memory consumption or cpu

Aug 13 '22 16:08 ricsanfre

Analyzing Prometheus memory usage

Prometheus Storage (tsdb)

Storage

When Prometheus scrapes a target, it retrieves thousands of metrics, which are compacted into chunks and stored in blocks before being written on disk. Only the head block is writable; all other blocks are immutable. By default, a block contain 2 hours of data.

The Head block is the in-memory part of the database and the grey blocks are persistent blocks on disk which are immutable. To prevent data loss, all incoming data is also written to a Write-Ahead-Log (WAL) for durable writes so the in-memory database can be repopulated on restart in case of failure. An incoming sample first goes into the Head block and stays into the memory for a while, which is then flushed to the disk and memory-mapped. And when these memory mapped chunks or the in-memory chunks get old to a certain point, they are flushed to the disk as persistent blocks. Further multiple blocks are merged as they get old (compactation) and finally deleted after they go beyond the retention period.

While the head block is kept in memory, blocks containing older metrics are accessed through mmap(). This system call acts like the swap; it will link a memory region to a file. This means all the content of the database is treated as if they were in memory without occupying any physical RAM.

All the blocks that are present in the TSDB are m-mapped to memory on startup.

Compactions

The head block is flushed to disk periodically, while at the same time, compactions to merge a few blocks together are performed to avoid needing to scan too many blocks for queries.

The initial two-hour blocks are eventually compacted into longer blocks in the background. Compaction will create larger blocks containing data spanning up to 10% of the retention time, or 31 days, whichever is smaller.

The wal files are only deleted once the head chunk has been flushed to disk.

Memory consumed by Prometheus POD

Prometheus exposes Go profiling tools. So using this tool we can see how Prometheus's different go routines are consuming the memory (Go heap memory)

go tool pprof -symbolize=remote -inuse_space https://monitoring.prod.cloud.coveo.com/debug/pprof/heap
File: prometheus
Type: inuse_space
Time: Aug 14, 2022 at 10:10am (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 366.56MB, 77.54% of 472.77MB total
Dropped 238 nodes (cum <= 2.36MB)
Showing top 10 nodes out of 96
      flat  flat%   sum%        cum   cum%
   99.95MB 21.14% 21.14%    99.95MB 21.14%  github.com/prometheus/prometheus/scrape.newScrapePool.func1
   70.55MB 14.92% 36.06%    70.55MB 14.92%  github.com/prometheus/prometheus/model/labels.(*Builder).Labels
   32.01MB  6.77% 42.83%    61.01MB 12.91%  github.com/prometheus/prometheus/tsdb/record.(*Decoder).Series
   29.51MB  6.24% 49.07%    29.51MB  6.24%  github.com/prometheus/prometheus/model/textparse.(*PromParser).Metric
   29.51MB  6.24% 55.32%    39.01MB  8.25%  github.com/prometheus/prometheus/tsdb.newMemSeries
      29MB  6.13% 61.45%       29MB  6.13%  github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).UvarintStr (inline)
   23.71MB  5.02% 66.47%    23.71MB  5.02%  github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete
   20.50MB  4.34% 70.80%    20.50MB  4.34%  github.com/prometheus/prometheus/tsdb/chunkenc.NewXORChunk
   20.20MB  4.27% 75.07%    20.20MB  4.27%  github.com/prometheus/prometheus/scrape.(*scrapeCache).trackStaleness
   11.63MB  2.46% 77.54%    11.63MB  2.46%  github.com/prometheus/prometheus/scrape.(*scrapeCache).addRef

IMPORTANT NOTE: RSS > (Go Heap In-use memory)

RSS (Resident set size) consumed by Prometheus container reported by the OS does not match up with the amount of memory usage reported by Go memory profiling tool (inuse_space/bytes)

There are various reasons for this:

RSS includes more than just Go heap memory usage (reported by profiling tool). It includes the memory used by goroutine stacks, the program executable, shared libraries as well as memory allocated by C functions.

The GC (Go Garbage Collector) may decide to not return free memory to the OS immediately, but this should be a lesser issue after runtime changes in Go 1.16.

Reduce number of scrapped time series

The most effective way to reduce Prometheus memory footprint is to reduce the number of scrapped metrics, since Prometheus keep in memory 2 hours of metric time series. This reduction can be achieved by removing metrics duplicates, and/or not used metrics.

References

Prometheus-Storage Prometheus TSDB - Blog explanation Prometheus - Investigation on high memory consumption Debugging Prometheus Memory Usage Anatomy of a Program in Memory Memory Measurements Complexities and Considerations - Buffers and (File Page) Cache Memory Measurements Complexities and Considerations - Kubernetes and Containers Kubernetes Node swap support Go profiler notes

Aug 14 '22 08:08 ricsanfre

Adhoc Prometheus Dashboard

Trying this grafana dashboard (Modified version of this)

prometheus-dashboard-2.json.txt

Aug 22 '22 17:08 ricsanfre

About K3S duplicate metrics

Memory footprint reduction is achieved by removing all metrics duplicates from K3S monitoring. See issue #67

Current K3S duplicates comes from monitoring kube-proxy, kubelet and apiserver components. (kube-controller-manager and kube-scheduler monitoring was already removed. (See issue #22)

Before removing K3S duplicates:

# Active Series	Memory Usage

Number of active time series: 157k Memory usage: 1GB

After removing duplicates

# Active Series	Memory Usage

Number of active time series: 73k Memory usage: 550 MB

Number of active time series has been reduced from 150k to 73k ( 50% reduction) and memory consumption has be reduced from 1GB to 550 MB (50% reduction)

Aug 29 '22 16:08 ricsanfre

pi-cluster pi-cluster copied to clipboard

Memory footprint optimization

About linkerd grafana dashboards

Applying resource limits to ElasticSearch

About increasing scraping interval for linkerd metrics

Analyzing Prometheus memory usage

Prometheus Storage (tsdb)

Storage

Compactions

Memory consumed by Prometheus POD

Reduce number of scrapped time series

References

Adhoc Prometheus Dashboard

About K3S duplicate metrics

pi-cluster
pi-cluster copied to clipboard