Unable to recover checkpoint data from interrupted process
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We have vector running as daemonset in AWS.
After the OOMkill (OOMkilled for 1000 events/sec only with limit of 1024Mi which is a question in itself), we are seeing this error
2024-06-28T12:12:42.050066Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Unable to recover checkpoint data from interrupted process. error=EOF while parsing a value at line 1 column 0
What can actually lead to this error and if there is a way to avoid it?
Configuration
customConfig:
data_dir: /vector-data-dir
expire_metrics_secs: 300
acknowledgements:
enabled: true
api:
enabled: true
address: 0.0.0.0:8686
playground: false
sources:
internal_metrics:
type: internal_metrics
kubernetes_logs:
type: kubernetes_logs
glob_minimum_cooldown_ms: 2000
ingestion_timestamp_field: "ingest_timestamp"
transforms:
dedot_keys:
type: remap
inputs:
- kubernetes_logs
source: |
. = map_keys(., recursive: true) -> |key| { replace(key, ".", "_") }
sinks:
kafka:
type: kafka
inputs:
- dedot_keys
bootstrap_servers: kafka:9092
topic: vector
encoding:
codec: json
compression: zstd
batch:
timeout_secs: 1
max_bytes: 1000000
max_events: 5000
librdkafka_options:
client.id: "vector"
request.required.acks: "1"
message_timeout_ms: 0
buffer:
type: memory
when_full: block
max_events: 500
prometheus_exporter:
type: prometheus_exporter
flush_period_secs: 60
inputs:
- internal_metrics
address: 0.0.0.0:9090
buffer:
type: memory
when_full: block
max_events: 500
Version
0.39.0
Debug Output
No response
Example Data
2024-06-28T12:12:42.034854Z INFO vector::app: Log level is enabled. level="info" 2024-06-28T12:12:42.036718Z INFO vector::config::watcher: Creating configuration file watcher. 2024-06-28T12:12:42.037583Z INFO vector::config::watcher: Watching configuration files. 2024-06-28T12:12:42.037625Z INFO vector::app: Loading configs. paths=["/etc/vector"] 2024-06-28T12:12:42.039203Z WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="kubernetes_logs" sink="kafka" 2024-06-28T12:12:42.039221Z WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="internal_metrics" sink="prometheus_exporter" 2024-06-28T12:12:42.040969Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Obtained Kubernetes Node name to collect logs for (self). self_node_name="ip-10-120-121-80.ec2.internal" 2024-06-28T12:12:42.047783Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Including matching files. ret=["/*"] 2024-06-28T12:12:42.047800Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Excluding matching files. ret=["/.gz", "**/.tmp"] 2024-06-28T12:12:42.049402Z INFO vector::topology::running: Running healthchecks. 2024-06-28T12:12:42.049477Z INFO vector: Vector has started. debug="false" version="0.39.0" arch="x86_64" revision="73da9bb 2024-06-17 16:00:23.791735272" 2024-06-28T12:12:42.049485Z INFO vector::topology::builder: Healthcheck passed. 2024-06-28T12:12:42.049757Z INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090 2024-06-28T12:12:42.050066Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Unable to recover checkpoint data from interrupted process. error=EOF while parsing a value at line 1 column 0 2024-06-28T12:12:42.050221Z WARN librdkafka: librdkafka: CONFWARN [thrd:app]: Configuration property request.required.acks is a producer property and will be ignored by this consumer instance 2024-06-28T12:12:42.050532Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Loaded checkpoint data.
Additional Context
References
No response