vector icon indicating copy to clipboard operation
vector copied to clipboard

thread 'vector-worker' panicked: Formatting argument out of range

Open charlie0129 opened this issue 4 months ago • 5 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Vector randomly panics with the following logs (see debug output).

Configuration

data_dir: /vector-data-dir
api:
  enabled: false
  address: 0.0.0.0:8686
  playground: true
sources:
  k8s:
    type: kubernetes_logs
    node_annotation_fields:
      node_labels: ""
    max_line_bytes: 2097152 # 2MiB
  internal_metrics:
    type: internal_metrics
transforms:
  parse_k8s_logs:
    type: remap
    inputs: [k8s]
    source: |
      .log = parse_json(.message) ??
          parse_key_value(.message, key_value_delimiter: "=", field_delimiter: " ", whitespace: "strict", accept_standalone_key: false) ??
          null
      .cluster = "xxx"
      .vendor  = "xxx"
      .region  = "xxx"
      .mtype   = "cpu"
  filter_k8s_logs: # kubernetes_logs collects all logs, we filter out old logs to save resources
    type: filter
    inputs: [parse_k8s_logs]
    condition: 'to_unix_timestamp(timestamp!(.timestamp)) > to_unix_timestamp(now()) - 2 * 60' # Only fetch logs from the last 2 minutes
sinks:
  exporter:
    type: prometheus_exporter
    address: 0.0.0.0:9090
    inputs: [internal_metrics]
  vlogs:
    type: elasticsearch
    inputs: [filter_k8s_logs]
    auth:
      strategy: basic
      user: xxx
      password: xxx
    endpoints: ["https://xxx/insert/elasticsearch"]
    mode: bulk
    api_version: v8
    compression: zstd
    batch: # batch size is increased to improve compression ratio
      max_bytes: 67108864 # 64MiB
      max_events: 99999999 # do not limit events, only bytes
      timeout_secs: 120 # 2 minutes
    buffer:
      type: disk
      max_size: 536870912 # 512MiB
      # max_events: 1000000 # do not limit events, only size
    distribution:
      retry_initial_backoff_secs: 5
    healthcheck:
      enabled: false
    request:
      headers:
        VL-Time-Field: timestamp
        VL-Stream-Fields: stream,kubernetes.pod_name,kubernetes.container_name,kubernetes.pod_namespace
        VL-Msg-Field: message,_message,msg,_msg,log.msg,log.message,log
        AccountID: "0"
        ProjectID: "0"

Version

0.49.0

Debug Output

vector thread 'vector-worker' panicked at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/nom-language-0.1.0/src/error.rs:210:37:
vector Formatting argument out of range
vector stack backtrace:
vector    0:     0x7fd565a16f76 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h755bd739a9a8330b
vector    1:     0x7fd564ca0e33 - core::fmt::write::ha6dc9f3e0fc90354
vector    2:     0x7fd565a1693f - std::io::Write::write_fmt::ha6203d3ee181dafd
vector    3:     0x7fd565a16d53 - std::sys::backtrace::BacktraceLock::print::hfba0c32ab558acd9
vector    4:     0x7fd565a1651a - std::panicking::rust_panic_with_hook::hcbca57a4f53780db
vector    5:     0x7fd565a5b9e5 - std::panicking::begin_panic_handler::{{closure}}::h6aad19e1296c91fb
vector    6:     0x7fd565a5b979 - std::sys::backtrace::__rust_end_short_backtrace::he6330e86917e67b5
vector    7:     0x7fd565a6124c - __rustc[4794b31dd7191200]::rust_begin_unwind
vector    8:     0x7fd56423beff - core::panicking::panic_fmt::h742c54b11a074f59
vector    9:     0x7fd568eb6ac6 - nom_language::error::convert_error::he3ab67a1b7e4aeea
vector   10:     0x7fd568ef16e2 - <vrl::compiler::expression::function::FunctionExpressionAdapter<T> as vrl::compiler::expression::Expression>::resolve::h5e8ce1a9e3c782c5
vector   11:     0x7fd568d4b157 - <vrl::compiler::expression::function_call::FunctionCall as vrl::compiler::expression::Expression>::resolve::hf240456b180ef0d3
vector   12:     0x7fd568d47367 - <vrl::compiler::expression::Expr as vrl::compiler::expression::Expression>::resolve::haf5bac83802dee05
vector   13:     0x7fd568d49e1c - <vrl::compiler::expression::op::Op as vrl::compiler::expression::Expression>::resolve::h9fa6378653a329fd
vector   14:     0x7fd568d473c2 - <vrl::compiler::expression::Expr as vrl::compiler::expression::Expression>::resolve::haf5bac83802dee05
vector   15:     0x7fd568d499f2 - <vrl::compiler::expression::op::Op as vrl::compiler::expression::Expression>::resolve::h9fa6378653a329fd
vector   16:     0x7fd568d473c2 - <vrl::compiler::expression::Expr as vrl::compiler::expression::Expression>::resolve::haf5bac83802dee05
vector   17:     0x7fd568d4755c - <vrl::compiler::expression::Expr as vrl::compiler::expression::Expression>::resolve::haf5bac83802dee05
vector   18:     0x7fd568d64d23 - vrl::compiler::runtime::Runtime::resolve::h62ab96db4818e672
vector   19:     0x7fd567f41206 - <vector::transforms::remap::Remap<Runner> as vector_core::transform::SyncTransform>::transform::h5ad9867476952795
vector   20:     0x7fd567f43fb7 - vector_core::transform::SyncTransform::transform_all::h89f994b585063f2d
vector   21:     0x7fd56883017f - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::h43d72c15bc7d20ac
vector   22:     0x7fd5688305a0 - tokio::runtime::task::raw::poll::h0f8f299a7709ec73
vector   23:     0x7fd565aca8ef - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h1bbd3ed6414b4f30
vector   24:     0x7fd565ac8943 - tokio::runtime::scheduler::multi_thread::worker::run::h482fc547e605cbba
vector   25:     0x7fd565ad2189 - tokio::runtime::task::raw::poll::he4e728f05bd411ae
vector   26:     0x7fd565aa408b - std::sys::backtrace::__rust_begin_short_backtrace::hf4fc553021301660
vector   27:     0x7fd565aac294 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc9d9af022b7da2da
vector   28:     0x7fd565a6173b - std::sys::pal::unix::thread::Thread::new::thread_start::hd5ffee2aac5a17bf
vector
vector thread 'vector-worker' panicked at src/topology/builder.rs:1008:30:
vector internal error: entered unreachable code: join error or bad poll
vector stack backtrace:
vector    0:     0x7fd565a16f76 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h755bd739a9a8330b
vector    1:     0x7fd564ca0e33 - core::fmt::write::ha6dc9f3e0fc90354
vector    2:     0x7fd565a1693f - std::io::Write::write_fmt::ha6203d3ee181dafd
vector    3:     0x7fd565a16d53 - std::sys::backtrace::BacktraceLock::print::hfba0c32ab558acd9
vector    4:     0x7fd565a1651a - std::panicking::rust_panic_with_hook::hcbca57a4f53780db
vector    5:     0x7fd565a5b9e5 - std::panicking::begin_panic_handler::{{closure}}::h6aad19e1296c91fb
vector    6:     0x7fd565a5b979 - std::sys::backtrace::__rust_end_short_backtrace::he6330e86917e67b5
vector    7:     0x7fd565a6124c - __rustc[4794b31dd7191200]::rust_begin_unwind
vector    8:     0x7fd56423beff - core::panicking::panic_fmt::h742c54b11a074f59
vector    9:     0x7fd56882d884 - vector::topology::builder::Runner::run_concurrently::{{closure}}::he4474e10ef3f6edc
vector   10:     0x7fd568825bc5 - vector::topology::builder::build_sync_transform::{{closure}}::h47ba1de404356d99
vector   11:     0x7fd568799a62 - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::h9eed0458fea094eb
vector   12:     0x7fd56879a7ee - tokio::runtime::task::raw::poll::ha150d43850a62f73
vector   13:     0x7fd565aca8b1 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h1bbd3ed6414b4f30
vector   14:     0x7fd565ac8943 - tokio::runtime::scheduler::multi_thread::worker::run::h482fc547e605cbba
vector   15:     0x7fd565ad2189 - tokio::runtime::task::raw::poll::he4e728f05bd411ae
vector   16:     0x7fd565aa408b - std::sys::backtrace::__rust_begin_short_backtrace::hf4fc553021301660
vector   17:     0x7fd565aac294 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc9d9af022b7da2da
vector   18:     0x7fd565a6173b - std::sys::pal::unix::thread::Thread::new::thread_start::hd5ffee2aac5a17bf
vector 2025-08-18T11:02:04.465897Z ERROR transform{component_kind="transform" component_id=parse_k8s_logs component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
vector 2025-08-18T11:02:04.465949Z  INFO vector: Vector has stopped.
vector 2025-08-18T11:02:04.468152Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="exporter, filter_k8s_logs, k8s, internal_metrics, vlogs, parse_k8s_logs" time_remaining="59 seconds left"

Example Data

No response

Additional Context

This is Vector running in Kubernetes (using official helm chart)

Values

role: "Agent"
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 20%
tolerations:
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
args:
  - -w
  - --config-dir
  - /etc/vector/
containerPorts:
  - name: prom-exporter
    containerPort: 9090
    protocol: TCP
service:
  enabled: false
useMyConfig: true
resources:
  requests:
    memory: "512Mi"
    cpu: "100m"
  limits:
    memory: "4Gi"
    cpu: "8000m"
initContainers:
  - name: set-inotify
    image: alpine:3.22
    imagePullPolicy: IfNotPresent
    command:
      - sh
      - -c
      - sysctl -w fs.inotify.max_user_instances=4096
    securityContext:
      privileged: true
env:
  - name: RUST_BACKTRACE
    value: full
podMonitor:
  enabled: true
  additionalLabels:
    release: kube-prometheus-stack
podPriorityClassName: system-cluster-critical # slightly lower than system-node-critical

References

No response

charlie0129 avatar Aug 18 '25 11:08 charlie0129

Seems that the parse_json or parse_key_value is faulty.

I resolved the issue by replacing

      .log = parse_json(.message) ??
          parse_key_value(.message, key_value_delimiter: "=", field_delimiter: " ", whitespace: "strict", accept_standalone_key: false) ??
          null

with

          msg = string(.message) ?? ""
          if is_json(msg, variant: "object") { # If JSON object
            .log = parse_json(msg) ?? null
          } else if contains(msg, "level=") &&  # If logfmt
            (contains(msg, "msg=") || contains(msg, "message=") || contains(msg, "log=")) &&
            (contains(msg, "time=") || contains(msg, "timestamp=") || contains(msg, "ts=")) {
            .log = parse_key_value(msg, key_value_delimiter: "=", field_delimiter: " ", whitespace: "strict", accept_standalone_key: false) ?? null
          }

It does some primitive checks in prior to tell if the message is json or logfmt, before sending it to the parser, to prevent some certain logs from breaking the parser.

charlie0129 avatar Aug 19 '25 10:08 charlie0129

Thank you for creating this issue, providing all the details and sharing a workaround! We should fix this issue properly by returning an error instead of panicking.

pront avatar Aug 19 '25 13:08 pront

We've started getting this panic after upgrading to 0.47.0 of the helm chart.

I've attempted to use the workaround mentioned in this comment but I can't seem to get it to work.

thread 'vector-worker' panicked at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/nom-language-0.1.0/src/error.rs:210:37:
Formatting argument out of range
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'vector-worker' panicked at src/topology/builder.rs:1098:30:
internal error: entered unreachable code: join error or bad poll
2025-11-19T10:19:45.590408Z ERROR transform{component_kind="transform" component_id=parsedefault component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2025-11-19T10:19:45.590460Z  INFO vector: Vector has stopped.
2025-11-19T10:19:45.592129Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="parsejson, prometheus, internal, parsenginx, parseklog, nginxaccessfilter, parsekv, nginxlogfilter, k8ssplit, k8s, parsedefault, es-common-local, parsenginxlogs, es-access-local, sanitize" time_remaining="59 seconds left"
Configuration details
  data_dir: /vector-data-dir
  api:
    enabled: true
    address: 0.0.0.0:8686
    playground: true
  sources:
    k8s:
      address: 0.0.0.0:6000
      type: vector
      version: "2"
    internal:
      type: 'internal_metrics'
  transforms:
    k8ssplit:
      type: route
      inputs:
        - k8s
      route:
        klogs: .kubernetes.pod_labels."team/log" == "klog"
        json: .kubernetes.pod_labels."team/log" == "json"
        kvlogs: .kubernetes.pod_labels."team/log" == "keyvalue"
        nginx: .kubernetes.pod_labels."app.kubernetes.io/name" == "ingress-nginx"
        kubeproxy: .kubernetes.pod_labels."k8s-app" == "kube-proxy"
    parsejson:
      type: remap
      inputs: ['k8ssplit.json']
      source: |
        structured =
          parse_json(.message) ??
          { "vector_parsed": false}
        . = merge(., structured) ?? .
    parseklog:
      type: remap
      inputs: ['k8ssplit.klogs', 'k8ssplit.kubeproxy']
      source: |
        structured =
          parse_klog(.message) ??
          { "vector_parsed": false}
        . = merge(., structured)

    parsekv:
      type: remap
      inputs: ['k8ssplit.kvlogs']
      source: |
        structured =
          parse_key_value(.message, "=", " ", "strict", false) ??
          { "vector_parsed": false}
        . = merge(., structured)

    parsedefault:
      type: remap
      inputs: ['k8ssplit._unmatched']
      source: |
        structured =
          parse_json(.message) ??
          parse_klog(.message) ??
          parse_syslog(.message) ??
          parse_common_log(.message) ??
          parse_key_value(.message, "=", " ", "strict", false) ??
          { "vector_parsed": false}
        . = merge(., structured) ?? .

    ## sanitize logs by removing unwanted label / timestamp etc
    sanitize:
      type: remap
      inputs: ['parsekv', 'parseklog', 'parsejson', 'parsedefault'] ## add json later
      source: |
        ok = is_timestamp(.timestamp)
        if !ok {
            .timestamp=now()
        }
        ok = exists(.kubernetes.pod_labels."app.kubernetes.io/name")
        if ok {
            .application= .kubernetes.pod_labels."app.kubernetes.io/name"
        } else {
            .application= .kubernetes.container_name
        }

        if exists(.msg) {
          .message=.msg
          del(.msg)
        }
        if exists(.kubernetes.pod_labels."app") {
          ."app"= .kubernetes.pod_labels."app"
        }
        if exists(.kubernetes.pod_labels."team") {
          ."team"= .kubernetes.pod_labels."team"
        }
        if exists(.kubernetes.pod_labels."domain") {
          ."domain"= .kubernetes.pod_labels."domain"
        }
        if exists(.kubernetes.pod_labels."service-type") {
          ."service-type"= .kubernetes.pod_labels."service-type"
        }
        del(.kubernetes.pod_labels)
        del(.kubernetes.node_labels)
        del(.kubernetes.namespace_labels)
        del(.kubernetes.pod_annotations)
        del(.time)

    ## first parse to address access logs
    parsenginx:
      type: remap
      inputs: ['k8ssplit.nginx']
      source: |
        .access_log = true
        ok = exists(.kubernetes.pod_labels."app.kubernetes.io/name")
        if ok {
            .application= .kubernetes.pod_labels."app.kubernetes.io/name"
        } else {
            .application= .kubernetes.container_name
        }
        structured =
          parse_grok(.message, "%{IPORHOST:remote_addr} - %{USERNAME:remote_user} \\\\[%{HTTPDATE:time_local}] \\"%{DATA:method} %{DATA:request}\\" %{INT:status} %{NUMBER:bytes_sent} \\"%{DATA:http_referer}\\" \\"%{DATA:http_user_agent}\\" %{INT:request_length} %{NUMBER:request_time} \\\\[%{DATA:proxy_upstream_name}] \\\\[%{DATA:proxy_alternative_upstream_name}] %{DATA:upstream_addr} %{INT:upstream_response_length} %{NUMBER:upstream_response_time} %{INT:upstream_status} %{DATA:req_id}") ??
          parse_grok(.message, "%{IPORHOST:remote_addr} - %{USERNAME:remote_user} \\\\[%{HTTPDATE:time_local}] \\"%{DATA:method} %{DATA:request}\\" %{INT:status} %{NUMBER:bytes_sent} \\"%{DATA:http_referer}\\" \\"%{DATA:http_user_agent}\\" %{INT:request_length} %{NUMBER:request_time} \\\\[%{DATA:proxy_upstream_name}] \\\\[%{DATA:proxy_alternative_upstream_name}] %{GREEDYDATA:other}") ??
          { "access_log": false}
        . = merge(., structured)
        if .access_log == true {
          del(.message)
        }
        del(.kubernetes.node_labels)
        del(.kubernetes.namespace_labels)
        del(.kubernetes.pod_annotations)
        del(.time)

    ## filter access logs
    nginxaccessfilter:
      type: filter
      inputs:
        - parsenginx
      condition:
        type: 'vrl'
        source: '.access_log == true'

    nginxlogfilter:
      type: filter
      inputs:
        - parsenginx
      condition:
        type: 'vrl'
        source: '.access_log == false'

    ## parse klog if its not access logs
    parsenginxlogs:
      type: remap
      inputs: ['nginxlogfilter']
      source: |
        structured =
          parse_klog(.message) ??
          { "vector_parsed": false}
        . = merge(., structured)
        del(.access_log)

  sinks:
    es-common-local:
      type: elasticsearch
      inputs: ['sanitize', 'parsenginxlogs']
      compression: gzip
      endpoints: ['${elasticSearchScheme}://${elasticSearchHost}:${elasticSearchPort}']
      mode: bulk
      buffer:
        type: disk
        max_size: ${esCommonBufferSizeGb * 1024 * 1024 * 1024}
      batch:
        max_bytes: 15728640  # 15MB
        timeout: 1
      bulk:
        index: 'eks-${environment}-vector-{{ "{{" }} kubernetes.pod_namespace {{ "}}" }}-%F'
      healthcheck:
        enabled: true
      auth:
        strategy: basic
        user: '${elasticSearchUser}'
        password: '${elasticSearchPassword}'
    es-access-local:
      type: elasticsearch
      inputs: ['nginxaccessfilter']
      compression: gzip
      endpoints: ['${elasticSearchScheme}://${elasticSearchHost}:${elasticSearchPort}']
      mode: bulk
      buffer:
        type: disk
        max_size: ${esAccessBufferSizeGb * 1024 * 1024 * 1024}
      bulk:
        index: 'accesslogs-${environment}-%F'
      healthcheck:
        enabled: true
      auth:
        strategy: basic
        user: '${elasticSearchUser}'
        password: '${elasticSearchPassword}'
    prometheus:
      type: 'prometheus_exporter'
      inputs: ['internal']
      address: '0.0.0.0:9090'

Kalmonipa avatar Nov 20 '25 12:11 Kalmonipa

Pure nom reproducer of the issue:

use nom_language::error::{VerboseError, VerboseErrorKind};

fn main() {
    // Create input with a large offset where the error occurs
    let input = "x".repeat(200_000) + "parsing fails here";
    let input_str = input.as_str();

    // Create substring starting far from the beginning
    // This simulates what happens when UTF-8 boundary calculations go wrong
    let error_location = &input_str[199_000..];

    // Construct the error as nom would
    let error = VerboseError {
        errors: vec![(
            error_location,
            VerboseErrorKind::Nom(nom::error::ErrorKind::Tag),
        )],
    };

    // This will panic!
    // nom-language calculates: column_number = line.offset(substring) + 1 = 199001
    // Then tries: `write!(&mut result, "{caret:>column$}", caret='^', column=199001)` in nom-language/src/error.rs
    // Formatter cannot allocate 199001 spaces and panics
    nom_language::error::convert_error(input_str, error);
}

I wasn't able to generate a Vector reproducer, but if someone has an instance of where a VRL function generates the nom error output and the offset of the error is big enough (at least 70k+ testing locally) this should be able to be reproduced. It also may be that failed VRL functions are generating an error string even when ?? is used.

thomasqueirozb avatar Nov 20 '25 18:11 thomasqueirozb

Update: got a Vector reproducer too

create.py

# Create a CEF line with 80K+ characters that will fail parsing
cef_start = "CEF:0|Security|threatmanager|1.0|100|"
long_name = "A" * 80000  # 80K name field
invalid_part = "BROKEN"  # Missing pipes - triggers parser error
print(cef_start + long_name + invalid_part + "\n")

test_crash.yaml

data_dir: /tmp/vector-crash-test

sources:
  long_file:
    type: file
    include:
      - test_long_line.txt
    read_from: beginning
    max_line_bytes: 200000

transforms:
  parse_logs:
    type: remap
    inputs: [long_file]
    source: |
      # Force error with ! operator - no fallback
      .log = parse_cef!(.message)

sinks:
  console:
    type: console
    inputs: [parse_logs]
    encoding:
      codec: json
rm -rf /tmp/vector-crash-test && mkdir -p /tmp/vector-crash-test && python create.py > test_long_line.txt && timeout 10 cargo run -- --config test_crash.yaml

output:

...
thread 'vector-worker' (17893211) panicked at /Users/thomas.schneider/.local/share/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/nom-language-0.1.0/src/error.rs:210:37:
Formatting argument out of range
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'vector-worker' (17893211) panicked at src/topology/builder.rs:1097:30:
internal error: entered unreachable code: join error or bad poll
2025-11-20T19:17:18.232622Z ERROR transform{component_kind="transform" component_id=parse_logs component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.

thomasqueirozb avatar Nov 20 '25 19:11 thomasqueirozb