vector icon indicating copy to clipboard operation
vector copied to clipboard

prometheus_remote_write memory leak

Open rightly opened this issue 9 months ago • 6 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

The vector's prometheus_remote_write appears to cause memory leak.

  • memory usage: image

  • Total memory usage by component image

    • Prometheus_remote_write is increasing linearly. image

I am running a vector separately that parses logs and creates metrics, and a vector that sends logs to loki. Among them, only vector pods that prometheus_remote_write are experiencing OOM as above.

Please help me on how to track this issue.

Configuration

  data_dir: /vector-data-dir
  expire_metrics_secs: 300
  api:
    enabled: true
    address: 0.0.0.0:8686
  sources:
    vector_metrics:
      type: internal_metrics
    http_input:
      type: http_server
      address: 0.0.0.0:9090
      path: /es
      auth:
        username: user
        password: pw
      decoding:
        codec: bytes
      keepalive:
        max_connection_age_secs: 60

  transforms:
    unnest_remap:
      type: "remap"
      inputs: ["http_input"]
      source: |
        . = parse_json!(string!(.message))
    json_remap:
      type: remap
      inputs: ["unnest_remap"]
      source: |
        if .maxAgeSec != "-" {
          .maxAgeSec = to_int!(.maxAgeSec)
        } else {
          .maxAgeSec = 0
        }

        .platform = "akamai"
        if .country == .serverCountry {
          .inCountry = true
        } else {
          .inCountry = false
        }

        if exists(.reqTimeSec) {
          .timestamp = to_float!(.reqTimeSec)
        }

        # parsing ipv4, ipv6
        if exists(.cliIP) {
          if is_ipv4!(.cliIP) {
            .ipVersion = "ipv4"
          } else if is_ipv6!(.cliIP) {
            .ipVersion = "ipv6"
          } else {
            .ipVersion = "unknown"
          }
        }

        # init cache status
        .cacheHit = "false"
        .cacheHitLayer = "false"

        # parsing edge cache status
        if exists(.cacheStatus) {
          if .cacheStatus != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "edge"
          }
        }

        if exists(.isMidgress) {
          if .isMidgress != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "midgress"
          }

          if .cacheStatus == "0" && .isMidgress != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "midgress_only"
          }
        }

    json_metric:
      type: log_to_metric
      inputs: ["json_remap"]
      metrics:
        - type: counter
          field: statusCode
          namespace: datastream
          name: http_response_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            method: '{{ "{{reqMethod}}" }}'
            status_code: '{{ "{{statusCode}}" }}'

        - type: counter
          field: cacheHit
          namespace: datastream
          name: http_cache_status_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            cache_hit: '{{ "{{cacheHit}}" }}'
            cache_layer: '{{ "{{cacheHitLayer}}" }}'

        - type: counter
          field: ipVersion
          namespace: datastream
          name: http_ip_version_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            ip_version: '{{ "{{ipVersion}}" }}'

        - type: gauge
          field: maxAgeSec
          name: http_max_age_second
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'

        - type: summary
          field: throughput
          namespace: datastream
          name: http_throughput_kbps
          increment_by_value: true
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            cache_hit: '{{ "{{cacheHit}}" }}'
            cache_layer: '{{ "{{cacheHitLayer}}" }}'
          # ..... and so on
    metric_remap:
      type: remap
      inputs: ["json_metric"]
      source: |
        .tags.forwarder = get_hostname!()
    metric_aggregate:
      type: aggregate
      inputs: ["metric_remap", "vector_metrics"]
      interval_ms: 60000 # 60s

  sinks:
    metric_write:
      type: prometheus_remote_write
      inputs: ["metric_aggregate"]
      endpoint: endpoint1
      compression: snappy
      auth:
        strategy: basic
        user: user
        password: pw
      batch:
        max_events: 32
        timeout_secs: 1
      buffer:
        type: disk
        max_size: 2147483648 # 2GiB
        when_full: block
    kafka_sink:
      type: kafka
      inputs: ["json_remap"]
      bootstrap_servers: server1
      topic: datastream
      batch:
        max_events: 1024
        timeout_secs: 5
      buffer:
        type: disk
        max_size: 5368709120 # 5GiB
        when_full: drop_newest # default block
      compression: snappy
      encoding:
        codec: json

Version

0.38.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

rightly avatar May 10 '24 01:05 rightly

Thanks for the detailed report! That does indeed look like a memory leak.

jszwedko avatar May 10 '24 15:05 jszwedko

@jszwedko Hi ! Is there any improvement plans for this issue?

rightly avatar Jun 24 '24 02:06 rightly

@jszwedko we are also facing similar issues, are there any plans to fix this?

shivmohith avatar Aug 01 '24 09:08 shivmohith

Unfortunately I haven't had a chance to dig into this yet 😞 I welcome anyone else giving it a shot though.

jszwedko avatar Aug 01 '24 14:08 jszwedko

that's okay. We are writing to Amazon Managed Prometheus, when the memory utilization is high and Vector prometheus_write sink shows a lot of errors, we found that in Amazon Managed Prometheus, the errors were "out of order" samples. I wanted to understand if "out of order" could be caused by the prometheus_write sink? Any thoughts?

shivmohith avatar Aug 01 '24 15:08 shivmohith

that's okay. We are writing to Amazon Managed Prometheus, when the memory utilization is high and Vector prometheus_write sink shows a lot of errors, we found that in Amazon Managed Prometheus, the errors were "out of order" samples. I wanted to understand if "out of order" could be caused by the prometheus_write sink? Any thoughts?

Hmm, that seems unrelated to this issue if I'm understanding correctly. This issue is about there being a memory leak in the prometheus_remote_write sink. Vector should send samples as it receives them.

jszwedko avatar Aug 01 '24 15:08 jszwedko