vector
vector copied to clipboard
prometheus_remote_write memory leak
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
The vector's prometheus_remote_write appears to cause memory leak.
-
memory usage:
-
Total memory usage by component
- Prometheus_remote_write is increasing linearly.
- Prometheus_remote_write is increasing linearly.
I am running a vector separately that parses logs and creates metrics, and a vector that sends logs to loki. Among them, only vector pods that prometheus_remote_write are experiencing OOM as above.
Please help me on how to track this issue.
Configuration
data_dir: /vector-data-dir
expire_metrics_secs: 300
api:
enabled: true
address: 0.0.0.0:8686
sources:
vector_metrics:
type: internal_metrics
http_input:
type: http_server
address: 0.0.0.0:9090
path: /es
auth:
username: user
password: pw
decoding:
codec: bytes
keepalive:
max_connection_age_secs: 60
transforms:
unnest_remap:
type: "remap"
inputs: ["http_input"]
source: |
. = parse_json!(string!(.message))
json_remap:
type: remap
inputs: ["unnest_remap"]
source: |
if .maxAgeSec != "-" {
.maxAgeSec = to_int!(.maxAgeSec)
} else {
.maxAgeSec = 0
}
.platform = "akamai"
if .country == .serverCountry {
.inCountry = true
} else {
.inCountry = false
}
if exists(.reqTimeSec) {
.timestamp = to_float!(.reqTimeSec)
}
# parsing ipv4, ipv6
if exists(.cliIP) {
if is_ipv4!(.cliIP) {
.ipVersion = "ipv4"
} else if is_ipv6!(.cliIP) {
.ipVersion = "ipv6"
} else {
.ipVersion = "unknown"
}
}
# init cache status
.cacheHit = "false"
.cacheHitLayer = "false"
# parsing edge cache status
if exists(.cacheStatus) {
if .cacheStatus != "0" {
.cacheHit = "true"
.cacheHitLayer = "edge"
}
}
if exists(.isMidgress) {
if .isMidgress != "0" {
.cacheHit = "true"
.cacheHitLayer = "midgress"
}
if .cacheStatus == "0" && .isMidgress != "0" {
.cacheHit = "true"
.cacheHitLayer = "midgress_only"
}
}
json_metric:
type: log_to_metric
inputs: ["json_remap"]
metrics:
- type: counter
field: statusCode
namespace: datastream
name: http_response_total
tags:
job: log_to_metric
hostname: '{{ "{{reqHost}}" }}'
country: '{{ "{{country}}" }}'
method: '{{ "{{reqMethod}}" }}'
status_code: '{{ "{{statusCode}}" }}'
- type: counter
field: cacheHit
namespace: datastream
name: http_cache_status_total
tags:
job: log_to_metric
hostname: '{{ "{{reqHost}}" }}'
country: '{{ "{{country}}" }}'
cache_hit: '{{ "{{cacheHit}}" }}'
cache_layer: '{{ "{{cacheHitLayer}}" }}'
- type: counter
field: ipVersion
namespace: datastream
name: http_ip_version_total
tags:
job: log_to_metric
hostname: '{{ "{{reqHost}}" }}'
country: '{{ "{{country}}" }}'
ip_version: '{{ "{{ipVersion}}" }}'
- type: gauge
field: maxAgeSec
name: http_max_age_second
tags:
job: log_to_metric
hostname: '{{ "{{reqHost}}" }}'
- type: summary
field: throughput
namespace: datastream
name: http_throughput_kbps
increment_by_value: true
tags:
job: log_to_metric
hostname: '{{ "{{reqHost}}" }}'
country: '{{ "{{country}}" }}'
cache_hit: '{{ "{{cacheHit}}" }}'
cache_layer: '{{ "{{cacheHitLayer}}" }}'
# ..... and so on
metric_remap:
type: remap
inputs: ["json_metric"]
source: |
.tags.forwarder = get_hostname!()
metric_aggregate:
type: aggregate
inputs: ["metric_remap", "vector_metrics"]
interval_ms: 60000 # 60s
sinks:
metric_write:
type: prometheus_remote_write
inputs: ["metric_aggregate"]
endpoint: endpoint1
compression: snappy
auth:
strategy: basic
user: user
password: pw
batch:
max_events: 32
timeout_secs: 1
buffer:
type: disk
max_size: 2147483648 # 2GiB
when_full: block
kafka_sink:
type: kafka
inputs: ["json_remap"]
bootstrap_servers: server1
topic: datastream
batch:
max_events: 1024
timeout_secs: 5
buffer:
type: disk
max_size: 5368709120 # 5GiB
when_full: drop_newest # default block
compression: snappy
encoding:
codec: json
Version
0.38.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
Thanks for the detailed report! That does indeed look like a memory leak.
@jszwedko Hi ! Is there any improvement plans for this issue?
@jszwedko we are also facing similar issues, are there any plans to fix this?
Unfortunately I haven't had a chance to dig into this yet 😞 I welcome anyone else giving it a shot though.
that's okay. We are writing to Amazon Managed Prometheus, when the memory utilization is high and Vector prometheus_write sink shows a lot of errors, we found that in Amazon Managed Prometheus, the errors were "out of order" samples. I wanted to understand if "out of order" could be caused by the prometheus_write sink? Any thoughts?
that's okay. We are writing to Amazon Managed Prometheus, when the memory utilization is high and Vector prometheus_write sink shows a lot of errors, we found that in Amazon Managed Prometheus, the errors were "out of order" samples. I wanted to understand if "out of order" could be caused by the prometheus_write sink? Any thoughts?
Hmm, that seems unrelated to this issue if I'm understanding correctly. This issue is about there being a memory leak in the prometheus_remote_write
sink. Vector should send samples as it receives them.