pi-cluster
pi-cluster copied to clipboard
Memory footprint optimization
Right now the PODs consuming the most resources are ES, Kibana and Prometheus:
Top PODs by memory
kubectl top pods -A --sort-by='memory'
NAMESPACE NAME CPU(cores) MEMORY(bytes)
k3s-logging efk-es-default-0 165m 1432Mi
k3s-monitoring prometheus-kube-prometheus-stack-prometheus-0 335m 1168Mi
k3s-logging efk-kb-66f7cdc8bd-hc4mb 222m 544Mi
k3s-monitoring kube-prometheus-stack-grafana-645d765844-m4q5r 55m 207Mi
k3s-logging fluentd-69bff468c6-f9dgc 37m 133Mi
linkerd linkerd-destination-787d745598-jh6pj 9m 69Mi
longhorn-system longhorn-manager-wmpx4 32m 64Mi
longhorn-system longhorn-manager-wmpj8 20m 62Mi
longhorn-system longhorn-manager-r22m9 27m 57Mi
velero-system velero-574c749cb6-tlwwl 4m 52Mi
certmanager-system certmanager-cert-manager-cainjector-78cd7b475-qc7vm 3m 50Mi
longhorn-system instance-manager-r-5be28d03 34m 50Mi
longhorn-system instance-manager-r-d15d9bab 33m 49Mi
Top PODs by cpu
kubectl top pods -A --sort-by='cpu'
NAMESPACE NAME CPU(cores) MEMORY(bytes)
k3s-monitoring prometheus-kube-prometheus-stack-prometheus-0 236m 1199Mi
k3s-logging efk-kb-66f7cdc8bd-hc4mb 226m 544Mi
k3s-logging efk-es-default-0 144m 1434Mi
longhorn-system instance-manager-e-6cc59123 81m 30Mi
longhorn-system instance-manager-e-aa8e1208 67m 28Mi
longhorn-system instance-manager-r-04a02b6c 36m 38Mi
longhorn-system instance-manager-r-d15d9bab 34m 48Mi
longhorn-system instance-manager-r-5be28d03 33m 50Mi
longhorn-system longhorn-manager-wmpx4 29m 65Mi
k3s-logging fluentd-69bff468c6-f9dgc 28m 133Mi
kube-system metrics-server-668d979685-rdgcn 25m 23Mi
k3s-monitoring kube-prometheus-stack-grafana-645d765844-m4q5r 24m 203Mi
longhorn-system longhorn-manager-wmpj8 24m 64Mi
Some changes to try
- [x] For ES/Kibana try to adjust the ammount of JVM heap assigned following the procedure in the documentation (https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html).
- [ ] For Prometheus, increase scraping interval of Linkerd metrics, currently 10 seconds to the default 30 sg.
- [ ] Review metrics: remove duplicates
About linkerd grafana dashboards
Grafana dashboards are using Prometheus' irate function using a interval of 30sg (irate(<metric>)[30sg]
). irate
function calculates the per-second instant rate of increase of the time series taking only the last two samples in the interval. This is not possible if the scrape interval is increase to 30sg.
Dashboards need to be modified. Maybe Grafana's variable $__rate_interval
can be used here.
Applying resource limits to ElasticSearch
According to ECK documentation, "The heap size of the JVM is automatically calculated based on the node roles and the available memory. The available memory is defined by the value of resources.limits.memory
set on the elasticsearch container in the Pod template, or the available memory on the Kubernetes node is no limit is set"
By default, the resources.limits.memory for ElasticSearch is 2GB
Reducing the limit to 1GB
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: efk
namespace: k3s-logging
spec:
version: 8.1.3
http: # Making elasticsearch service available from outisde the cluster
tls:
selfSignedCertificate:
disabled: true
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: longhorn
podTemplate:
spec:
# Limiting Resources consumption
containers:
- name: elasticsearch
resources:
requests:
memory: 1Gi
limits:
memory: 1Gi
The results is that elasticsearch memory consumption is reduced by 500Mb
About increasing scraping interval for linkerd metrics
Using default 30sg,instead of 10 sg, only has no effects on memory consumption or cpu
Analyzing Prometheus memory usage
Prometheus Storage (tsdb)
Storage
When Prometheus scrapes a target, it retrieves thousands of metrics, which are compacted into chunks and stored in blocks before being written on disk. Only the head block is writable; all other blocks are immutable. By default, a block contain 2 hours of data.
The Head block is the in-memory part of the database and the grey blocks are persistent blocks on disk which are immutable. To prevent data loss, all incoming data is also written to a Write-Ahead-Log (WAL) for durable writes so the in-memory database can be repopulated on restart in case of failure. An incoming sample first goes into the Head block and stays into the memory for a while, which is then flushed to the disk and memory-mapped. And when these memory mapped chunks or the in-memory chunks get old to a certain point, they are flushed to the disk as persistent blocks. Further multiple blocks are merged as they get old (compactation) and finally deleted after they go beyond the retention period.
While the head block is kept in memory, blocks containing older metrics are accessed through mmap(). This system call acts like the swap; it will link a memory region to a file. This means all the content of the database is treated as if they were in memory without occupying any physical RAM.
All the blocks that are present in the TSDB are m-mapped to memory on startup.
Compactions
The head block is flushed to disk periodically, while at the same time, compactions to merge a few blocks together are performed to avoid needing to scan too many blocks for queries.
The initial two-hour blocks are eventually compacted into longer blocks in the background. Compaction will create larger blocks containing data spanning up to 10% of the retention time, or 31 days, whichever is smaller.
The wal files are only deleted once the head chunk has been flushed to disk.
Memory consumed by Prometheus POD
Prometheus exposes Go profiling tools. So using this tool we can see how Prometheus's different go routines are consuming the memory (Go heap memory)
go tool pprof -symbolize=remote -inuse_space https://monitoring.prod.cloud.coveo.com/debug/pprof/heap
File: prometheus
Type: inuse_space
Time: Aug 14, 2022 at 10:10am (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 366.56MB, 77.54% of 472.77MB total
Dropped 238 nodes (cum <= 2.36MB)
Showing top 10 nodes out of 96
flat flat% sum% cum cum%
99.95MB 21.14% 21.14% 99.95MB 21.14% github.com/prometheus/prometheus/scrape.newScrapePool.func1
70.55MB 14.92% 36.06% 70.55MB 14.92% github.com/prometheus/prometheus/model/labels.(*Builder).Labels
32.01MB 6.77% 42.83% 61.01MB 12.91% github.com/prometheus/prometheus/tsdb/record.(*Decoder).Series
29.51MB 6.24% 49.07% 29.51MB 6.24% github.com/prometheus/prometheus/model/textparse.(*PromParser).Metric
29.51MB 6.24% 55.32% 39.01MB 8.25% github.com/prometheus/prometheus/tsdb.newMemSeries
29MB 6.13% 61.45% 29MB 6.13% github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).UvarintStr (inline)
23.71MB 5.02% 66.47% 23.71MB 5.02% github.com/prometheus/prometheus/tsdb/index.(*MemPostings).Delete
20.50MB 4.34% 70.80% 20.50MB 4.34% github.com/prometheus/prometheus/tsdb/chunkenc.NewXORChunk
20.20MB 4.27% 75.07% 20.20MB 4.27% github.com/prometheus/prometheus/scrape.(*scrapeCache).trackStaleness
11.63MB 2.46% 77.54% 11.63MB 2.46% github.com/prometheus/prometheus/scrape.(*scrapeCache).addRef
IMPORTANT NOTE: RSS > (Go Heap In-use memory)
RSS (Resident set size) consumed by Prometheus container reported by the OS does not match up with the amount of memory usage reported by Go memory profiling tool (inuse_space/bytes)
There are various reasons for this:
- RSS includes more than just Go heap memory usage (reported by profiling tool). It includes the memory used by goroutine stacks, the program executable, shared libraries as well as memory allocated by C functions.
- The GC (Go Garbage Collector) may decide to not return free memory to the OS immediately, but this should be a lesser issue after runtime changes in Go 1.16.
Reduce number of scrapped time series
The most effective way to reduce Prometheus memory footprint is to reduce the number of scrapped metrics, since Prometheus keep in memory 2 hours of metric time series. This reduction can be achieved by removing metrics duplicates, and/or not used metrics.
References
Prometheus-Storage Prometheus TSDB - Blog explanation Prometheus - Investigation on high memory consumption Debugging Prometheus Memory Usage Anatomy of a Program in Memory Memory Measurements Complexities and Considerations - Buffers and (File Page) Cache Memory Measurements Complexities and Considerations - Kubernetes and Containers Kubernetes Node swap support Go profiler notes
About K3S duplicate metrics
Memory footprint reduction is achieved by removing all metrics duplicates from K3S monitoring. See issue #67
Current K3S duplicates comes from monitoring kube-proxy, kubelet and apiserver components. (kube-controller-manager and kube-scheduler monitoring was already removed. (See issue #22)
Before removing K3S duplicates:
# Active Series | Memory Usage |
---|---|
![]() |
![]() |
Number of active time series: 157k Memory usage: 1GB
After removing duplicates
# Active Series | Memory Usage |
---|---|
![]() |
![]() |
Number of active time series: 73k Memory usage: 550 MB
Number of active time series has been reduced from 150k to 73k ( 50% reduction) and memory consumption has be reduced from 1GB to 550 MB (50% reduction)