vector icon indicating copy to clipboard operation
vector copied to clipboard

Unable to recover checkpoint data from interrupted process

Open ShahroZafar opened this issue 1 year ago • 0 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have vector running as daemonset in AWS.

After the OOMkill (OOMkilled for 1000 events/sec only with limit of 1024Mi which is a question in itself), we are seeing this error

2024-06-28T12:12:42.050066Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Unable to recover checkpoint data from interrupted process. error=EOF while parsing a value at line 1 column 0

What can actually lead to this error and if there is a way to avoid it?

Configuration

customConfig:
  data_dir: /vector-data-dir
  expire_metrics_secs: 300
  acknowledgements:
    enabled: true
  api:
    enabled: true
    address: 0.0.0.0:8686
    playground: false
  sources:
    internal_metrics:
      type: internal_metrics
    kubernetes_logs:
      type: kubernetes_logs
      glob_minimum_cooldown_ms: 2000
      ingestion_timestamp_field: "ingest_timestamp"
  transforms:
    dedot_keys:
      type: remap
      inputs:
        - kubernetes_logs
      source: |
        . = map_keys(., recursive: true) -> |key| { replace(key, ".", "_") }
  sinks:
    kafka:
      type: kafka
      inputs:
        - dedot_keys
      bootstrap_servers: kafka:9092
      topic: vector
      encoding:
        codec: json
      compression: zstd
      batch:
        timeout_secs: 1
        max_bytes: 1000000
        max_events: 5000
      librdkafka_options:
        client.id: "vector"
        request.required.acks: "1"
      message_timeout_ms: 0
      buffer:
        type: memory
        when_full: block
        max_events: 500
    prometheus_exporter:
      type: prometheus_exporter
      flush_period_secs: 60
      inputs:
        - internal_metrics
      address: 0.0.0.0:9090
      buffer:
        type: memory
        when_full: block
        max_events: 500

Version

0.39.0

Debug Output

No response

Example Data

2024-06-28T12:12:42.034854Z INFO vector::app: Log level is enabled. level="info" 2024-06-28T12:12:42.036718Z INFO vector::config::watcher: Creating configuration file watcher. 2024-06-28T12:12:42.037583Z INFO vector::config::watcher: Watching configuration files. 2024-06-28T12:12:42.037625Z INFO vector::app: Loading configs. paths=["/etc/vector"] 2024-06-28T12:12:42.039203Z WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="kubernetes_logs" sink="kafka" 2024-06-28T12:12:42.039221Z WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="internal_metrics" sink="prometheus_exporter" 2024-06-28T12:12:42.040969Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Obtained Kubernetes Node name to collect logs for (self). self_node_name="ip-10-120-121-80.ec2.internal" 2024-06-28T12:12:42.047783Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Including matching files. ret=["/*"] 2024-06-28T12:12:42.047800Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Excluding matching files. ret=["/.gz", "**/.tmp"] 2024-06-28T12:12:42.049402Z INFO vector::topology::running: Running healthchecks. 2024-06-28T12:12:42.049477Z INFO vector: Vector has started. debug="false" version="0.39.0" arch="x86_64" revision="73da9bb 2024-06-17 16:00:23.791735272" 2024-06-28T12:12:42.049485Z INFO vector::topology::builder: Healthcheck passed. 2024-06-28T12:12:42.049757Z INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090 2024-06-28T12:12:42.050066Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Unable to recover checkpoint data from interrupted process. error=EOF while parsing a value at line 1 column 0 2024-06-28T12:12:42.050221Z WARN librdkafka: librdkafka: CONFWARN [thrd:app]: Configuration property request.required.acks is a producer property and will be ignored by this consumer instance 2024-06-28T12:12:42.050532Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Loaded checkpoint data.

Additional Context

image image image

References

No response

ShahroZafar avatar Jun 28 '24 12:06 ShahroZafar