gcp_cloud_storage sink: thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io' - Vector Aggregator
Problem
Vector aggregator panicks and gets restarted for gcp_cloud_storage sink intermittently
Vector Version
vector 0.26.0
Log
2023-11-03T08:16:57.126512Z INFO vector::topology::builder: Healthcheck passed.
2023-11-03T08:16:57.144758Z INFO vector::topology::builder: Healthcheck passed.
2023-11-03T08:16:57.160504Z INFO vector::topology::builder: Healthcheck passed.
2023-11-03T13:47:06.246087Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=18446744073709487194 intentional=false reason=corrupted_events stage=0
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2023-11-03T13:47:08.979414Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2023-11-03T13:47:08.979575Z INFO vector: Vector has stopped.
Additional Context
Flow:
Kubernetes Pods --> Vector Agent --> Vector Aggregator --> Google Cloud Storage Bucket
Vector aggregator works fine throughout but fails intermittently after a few hours and gets restarted with the above error. Good amount of CPU and Memory is available on nodes running vector aggregator pods.
Configuration
customConfig:
data_dir: /vector-data-dir
api:
enabled: true
address: 0.0.0.0:8686
playground: false
sources:
agent:
address: 0.0.0.0:6000
type: vector
vector_aggregator_logs:
type: "internal_logs"
transforms:
agent_remap_logs:
type: remap
inputs:
- agent
source: |-
structured,err = parse_json(.message)
if err == null {
.,err = merge(., structured)
if err == null{
del(.message)
}
}
application_logs:
type: filter
inputs:
- agent_remap_logs
condition:
type: "vrl"
source: " .source_type == \"kubernetes_logs\""
vector_agent_logs:
type: filter
inputs:
- application_logs
condition:
type: "vrl"
source: " .source_type == \"internal_logs\""
sinks:
cloud_storage:
type: gcp_cloud_storage
inputs: [ application_logs ]
compression: gzip
buffer:
max_size: 4294967296
type: disk
when_full: block
batch:
max_bytes: 209715200
timeout_secs: 60
request:
concurrency: adaptive
rate_limit_duration_secs: 60
shorthand
encoding:
codec: json
framing:
method: "newline_delimited"
metadata:
Content-Encoding: ''
bucket: abhinav-new-logs-bucket
key_prefix: 'k8s/year=%Y/month=%m/day=%d/hour=%H/%M_'
No response
Version
vector-0.26.0
The error is likely occurring because the buffer file that Vector is using ceases to exist. How is /vector-data-dir mounted to the pod?
The error is likely occurring because the buffer file that Vector is using ceases to exist. How is
/vector-data-dirmounted to the pod?
@StephenWakely /vector-data-dir is mounted in the form of a persistent disk (Persistent Volume). Vector aggregator is deployed as a stateful set backed by persistent disks.
Can you think of any reason why Vector may lose access to that disk? When Vector is restarted do you have any data loss?
@StephenWakely We have huge volume of data going through and it is tough for us to check if there's a data loss during that particular restart. However, I assume the below log line hints towards data loss :
vector_buffers::internal_events: Events dropped. count=18446744073709487194 intentional=false reason=corrupted_events stage=0
As per my understanding, Vector will not lose access to the disk. The disk is a GCP managed Persistent Volume and hence there is a minimal chance of that becoming unavailable.
The count is a bit suspicious. That's a lot of events to drop!
Are you able to confirm if this error occurs on the latest version of Vector (0.33.1)? 0.26 is quite old.
@StephenWakely We have multiple other clusters which are running Vector 0.32.0 and similar error is being seen on most of the clusters.
Here's a log snippet from one of the clusters:
2023-11-04T15:35:24.377344Z INFO vector::app: Log level is enabled. level="INFO"
2023-11-04T15:35:24.377372Z WARN vector::app: DEPRECATED The openssl legacy provider provides algorithms and key sizes no longer recommended for use.
2023-11-04T15:35:24.377676Z INFO vector::app: Loaded openssl legacy provider.
2023-11-04T15:35:24.378435Z INFO vector::config::watcher: Creating configuration file watcher.
2023-11-04T15:35:24.378790Z INFO vector::config::watcher: Watching configuration files.
2023-11-04T15:35:24.378873Z INFO vector::app: Loading configs. paths=["/etc/vector"]
2023-11-04T15:35:24.905010Z INFO vector::topology::running: Running healthchecks.
2023-11-04T15:35:24.905209Z INFO vector: Vector has started. debug="false" version="0.32.0" arch="x86_64" revision="1b403e1 2023-08-15 14:56:36.089460954"
2023-11-04T15:35:24.905270Z INFO source{component_kind="source" component_id=agent component_type=vector component_name=agent}: vector::sources::util::grpc: Building gRPC server. address=0.0.0.0:6000
2023-11-04T15:35:24.907948Z INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=off
2023-11-04T15:35:26.282392Z INFO vector::topology::builder: Healthcheck passed.
2023-11-04T15:35:26.311784Z INFO vector::topology::builder: Healthcheck passed.
2023-11-04T15:35:26.409646Z INFO vector::topology::builder: Healthcheck passed.
2023-11-07T06:45:57.670473Z WARN sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:request{request_id=122254}: vector::sinks::util::retries: Retrying after response. reason=500 Internal Server Error internal_log_rate_limit=true
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2023-11-07T09:57:20.565298Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=18446744073709508238 intentional=false reason=corrupted_events stage=0
2023-11-07T09:57:21.474346Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2023-11-07T09:57:21.474457Z INFO vector: Vector has stopped.
I also encountered this problem when using s3 sink. Any suggestions?
I also encountered this problem when using s3 sink. Any suggestions?
The issue is not specific to any particular sink. The issue seems to be when the Vector loses access the file that it is storing the buffer on. Can you provide more details about the volume the data_dir is pointing to?
@StephenWakely The data_dir is pointing to a persistent volume in EKS.
Here is the log:
2023-11-16T19:34:55.434011Z INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-11-16T19:34:55.434634Z INFO vector::config::watcher: Creating configuration file watcher.
2023-11-16T19:34:55.434823Z INFO vector::config::watcher: Watching configuration files.
2023-11-16T19:34:55.434867Z INFO vector::app: Loading configs. paths=["/etc/vector"]
2023-11-16T19:34:58.313122Z INFO vector::topology::running: Running healthchecks.
2023-11-16T19:34:58.313284Z INFO vector: Vector has started. debug="false" version="0.33.0" arch="aarch64" revision="89605fb 2023-09-27 14:18:24.180809939"
2023-11-16T19:34:58.314661Z INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off
2023-11-16T19:34:58.318700Z INFO vector::topology::builder: Healthcheck disabled.
2023-11-16T19:34:58.486112Z INFO vector::topology::builder: Healthcheck passed.
2023-11-16T19:34:58.486309Z INFO vector::topology::builder: Healthcheck passed.
2023-11-16T19:34:58.493455Z INFO vector::topology::builder: Healthcheck passed.
2023-11-17T06:48:24.361852Z WARN sink{component_kind="sink" component_id=block_s3 component_type=aws_s3 component_name=block_s3}:request{request_id=2568}: vector::sinks::util::retries: Retrying after error. error=dispatch failure internal_log_rate_limit=true
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
stack backtrace:
0: 0xaaaac6f694a0 - std::backtrace_rs::backtrace::libunwind::trace::hc366d13e3075cfeb
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: 0xaaaac6f694a0 - std::backtrace_rs::backtrace::trace_unsynchronized::hb9a35e1f782586fe
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0xaaaac6f694a0 - std::sys_common::backtrace::_print_fmt::hae0d6c377f414e8d
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:65:5
3: 0xaaaac6f694a0 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h0f8916ea7cac138e
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:44:22
4: 0xaaaac6738860 - core::fmt::rt::Argument::fmt::hf491ec0da00c9b40
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/rt.rs:138:9
5: 0xaaaac6738860 - core::fmt::write::h075f235dea7f2ae2
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:1094:21
6: 0xaaaac6f3107c - std::io::Write::write_fmt::hed461f96db027bb0
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/io/mod.rs:1714:15
7: 0xaaaac6f6c104 - std::sys_common::backtrace::_print::h9f6cc87d40ba87f7
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:47:5
8: 0xaaaac6f6c104 - std::sys_common::backtrace::print::hc36324e2f472d029
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:34:9
9: 0xaaaac6f6bd04 - std::panicking::default_hook::{{closure}}::h8d8f8741b904046c
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:269:22
10: 0xaaaac6f6cc2c - std::panicking::default_hook::h6fd0090c1af86ae9
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:288:9
11: 0xaaaac6f6cc2c - std::panicking::rust_panic_with_hook::h25082e1f49b04873
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:705:13
12: 0xaaaac6f6c760 - std::panicking::begin_panic_handler::{{closure}}::hd64c14bd1869359d
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:597:13
13: 0xaaaac6f6c6d4 - std::sys_common::backtrace::__rust_end_short_backtrace::h8613664e493d4a0e
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:151:18
14: 0xaaaac6f6c6c8 - rust_begin_unwind
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
15: 0xaaaac617ff10 - core::panicking::panic_fmt::h354dad60a2e185d5
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
16: 0xaaaac8affef4 - vector_buffers::topology::channel::receiver::ReceiverAdapter<T>::next::{{closure}}::hd01ea73e57558309
17: 0xaaaac8aff180 - vector_buffers::topology::channel::receiver::BufferReceiver<T>::next::{{closure}}::hb2fd72c713890c51
18: 0xaaaac8e33ae0 - vector_buffers::topology::channel::receiver::make_recv_future::{{closure}}::h8bde19c02063e358
19: 0xaaaac8e51bf4 - <vector_buffers::topology::channel::receiver::BufferReceiverStream<T> as futures_core::stream::Stream>::poll_next::hc22bfbb6df1b82e8
20: 0xaaaac8e512b8 - <vector::utilization::Utilization<S> as futures_core::stream::Stream>::poll_next::h797c2c8c6db0517e
21: 0xaaaac8e50d18 - <stream_cancel::combinator::TakeUntilIf<S,F> as futures_core::stream::Stream>::poll_next::h4d57577fca8bce3f
22: 0xaaaac717fc94 - <futures_util::stream::stream::FlatMap<St,U,F> as futures_core::stream::Stream>::poll_next::h317498fdf5ec7754
23: 0xaaaac749b570 - <vector_core::stream::partitioned_batcher::PartitionedBatcher<St,Prt,KT> as futures_core::stream::Stream>::poll_next::hcaafe2d3da438c2f
24: 0xaaaac7490a90 - <vector::sinks::s3_common::sink::S3Sink<Svc,RB> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}::h62aa3ccbf71cb5c0
25: 0xaaaac748e77c - <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}::h8a41f58bf9c39efa
26: 0xaaaac8e4ff38 - vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}::hb14f5ad14bc172c1
27: 0xaaaac8e132ec - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hd774630cd39b8642
28: 0xaaaac8e12b88 - tokio::runtime::task::raw::poll::hc33b78532c44772c
29: 0xaaaac6fb00b4 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::hbfa7fdd8cfe3f82d
30: 0xaaaac6fb5000 - tokio::runtime::task::raw::poll::h585308cd792d3405
31: 0xaaaac6f9a6b8 - std::sys_common::backtrace::__rust_begin_short_backtrace::h5c4c394bb284ddb5
32: 0xaaaac6f9a3a4 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc419499b44d1e34f
33: 0xaaaac6f6f0a8 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hd6a6c972c6220bcb
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
34: 0xaaaac6f6f0a8 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h3b759ee01954e7b0
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
35: 0xaaaac6f6f0a8 - std::sys::unix::thread::Thread::new::thread_start::hd2c1311433b59ac2
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys/unix/thread.rs:108:17
36: 0xffffbc6bee18 - <unknown>
37: 0xffffbc727e9c - <unknown>
38: 0x0 - <unknown>
2023-11-20T08:32:01.454452Z ERROR sink{component_kind="sink" component_id=block_s3 component_type=aws_s3 component_name=block_s3}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2023-11-20T08:32:01.454504Z INFO vector: Vector has stopped.
2023-11-20T08:32:01.455419Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="kafka_remap, kafka_json, block, kafka_block, seller_s3, buyer_s3, buyer, prometheus, seller" time_remaining="119 seconds left"
2023-11-20T08:32:02.276422Z ERROR sink{component_kind="sink" component_id=block_s3 component_type=aws_s3 component_name=block_s3}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=18446744073708353028 intentional=false reason=corrupted_events stage=0
2023-11-20T08:32:06.455329Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="seller_s3, buyer_s3" time_remaining="114 seconds left"
2023-11-20T08:32:11.455850Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="109 seconds left"
2023-11-20T08:32:16.455884Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="104 seconds left"
2023-11-20T08:32:21.455404Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="99 seconds left"
2023-11-20T08:32:26.455498Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="94 seconds left"
2023-11-20T08:32:31.455543Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="89 seconds left"
2023-11-20T08:32:36.455425Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="84 seconds left"
2023-11-20T08:32:41.455380Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="79 seconds left"
2023-11-20T08:32:46.456226Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="74 seconds left"
2023-11-20T08:32:51.455299Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="69 seconds left"
╭─liyanjun at liyanjun的MacBook Pro in ~/code/rust 23-11-22 - 10:31:21
╰─○ k get po
NAME READY STATUS RESTARTS AGE
report-geoedge-5c768796b6-8wgrc 0/2 CrashLoopBackOff 424 (3m51s ago) 17h
report-geoedge-5c768796b6-hmpk7 2/2 Running 0 17h
vector-s3-normal-0 1/1 Running 2 (43h ago) 5d19h
vector-s3-normal-1 1/1 Running 1 (41h ago) 5d7h
vector-s3-normal-2 1/1 Running 2 (43h ago) 5d19h
vector-s3-normal-3 1/1 Running 0 10h
vector-s3-normal-4 1/1 Running 1 (46h ago) 5d1h
vector-s3-primary-0 1/1 Running 1 (14h ago) 18h
vector-s3-primary-1 1/1 Running 0 10h
vector-s3-primary-2 1/1 Running 0 143m
vector-s3-primary-3 1/1 Running 0 19h
╭─liyanjun at liyanjun的MacBook Pro in ~/code/rust 23-11-22 - 10:31:47
╰─○ k logs vector-s3-primary-0 -p
2023-11-21T07:33:00.820662Z INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-11-21T07:33:00.821223Z INFO vector::config::watcher: Creating configuration file watcher.
2023-11-21T07:33:00.821417Z INFO vector::config::watcher: Watching configuration files.
2023-11-21T07:33:00.821489Z INFO vector::app: Loading configs. paths=["/etc/vector"]
2023-11-21T07:33:01.921578Z INFO vector::topology::running: Running healthchecks.
2023-11-21T07:33:01.921740Z INFO vector: Vector has started. debug="false" version="0.33.0" arch="aarch64" revision="89605fb 2023-09-27 14:18:24.180809939"
2023-11-21T07:33:01.922879Z INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off
2023-11-21T07:33:01.927194Z INFO vector::topology::builder: Healthcheck disabled.
2023-11-21T07:33:02.043988Z INFO vector::topology::builder: Healthcheck passed.
2023-11-21T07:33:02.047747Z INFO vector::topology::builder: Healthcheck passed.
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
stack backtrace:
0: 0xaaaad70594a0 - std::backtrace_rs::backtrace::libunwind::trace::hc366d13e3075cfeb
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: 0xaaaad70594a0 - std::backtrace_rs::backtrace::trace_unsynchronized::hb9a35e1f782586fe
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0xaaaad70594a0 - std::sys_common::backtrace::_print_fmt::hae0d6c377f414e8d
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:65:5
3: 0xaaaad70594a0 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h0f8916ea7cac138e
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:44:22
4: 0xaaaad6828860 - core::fmt::rt::Argument::fmt::hf491ec0da00c9b40
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/rt.rs:138:9
5: 0xaaaad6828860 - core::fmt::write::h075f235dea7f2ae2
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:1094:21
6: 0xaaaad702107c - std::io::Write::write_fmt::hed461f96db027bb0
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/io/mod.rs:1714:15
7: 0xaaaad705c104 - std::sys_common::backtrace::_print::h9f6cc87d40ba87f7
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:47:5
8: 0xaaaad705c104 - std::sys_common::backtrace::print::hc36324e2f472d029
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:34:9
9: 0xaaaad705bd04 - std::panicking::default_hook::{{closure}}::h8d8f8741b904046c
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:269:22
10: 0xaaaad705cc2c - std::panicking::default_hook::h6fd0090c1af86ae9
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:288:9
11: 0xaaaad705cc2c - std::panicking::rust_panic_with_hook::h25082e1f49b04873
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:705:13
12: 0xaaaad705c760 - std::panicking::begin_panic_handler::{{closure}}::hd64c14bd1869359d
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:597:13
13: 0xaaaad705c6d4 - std::sys_common::backtrace::__rust_end_short_backtrace::h8613664e493d4a0e
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:151:18
14: 0xaaaad705c6c8 - rust_begin_unwind
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
15: 0xaaaad626ff10 - core::panicking::panic_fmt::h354dad60a2e185d5
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
16: 0xaaaad8befef4 - vector_buffers::topology::channel::receiver::ReceiverAdapter<T>::next::{{closure}}::hd01ea73e57558309
17: 0xaaaad8bef180 - vector_buffers::topology::channel::receiver::BufferReceiver<T>::next::{{closure}}::hb2fd72c713890c51
18: 0xaaaad8f23ae0 - vector_buffers::topology::channel::receiver::make_recv_future::{{closure}}::h8bde19c02063e358
19: 0xaaaad8f41bf4 - <vector_buffers::topology::channel::receiver::BufferReceiverStream<T> as futures_core::stream::Stream>::poll_next::hc22bfbb6df1b82e8
20: 0xaaaad8f412b8 - <vector::utilization::Utilization<S> as futures_core::stream::Stream>::poll_next::h797c2c8c6db0517e
21: 0xaaaad8f40d18 - <stream_cancel::combinator::TakeUntilIf<S,F> as futures_core::stream::Stream>::poll_next::h4d57577fca8bce3f
22: 0xaaaad726fc94 - <futures_util::stream::stream::FlatMap<St,U,F> as futures_core::stream::Stream>::poll_next::h317498fdf5ec7754
23: 0xaaaad758b570 - <vector_core::stream::partitioned_batcher::PartitionedBatcher<St,Prt,KT> as futures_core::stream::Stream>::poll_next::hcaafe2d3da438c2f
24: 0xaaaad7580a90 - <vector::sinks::s3_common::sink::S3Sink<Svc,RB> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}::h62aa3ccbf71cb5c0
25: 0xaaaad757e77c - <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}::h8a41f58bf9c39efa
26: 0xaaaad8f3ff38 - vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}::hb14f5ad14bc172c1
27: 0xaaaad8f032ec - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hd774630cd39b8642
28: 0xaaaad8f02b88 - tokio::runtime::task::raw::poll::hc33b78532c44772c
29: 0xaaaad70a00b4 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::hbfa7fdd8cfe3f82d
30: 0xaaaad70a5000 - tokio::runtime::task::raw::poll::h585308cd792d3405
31: 0xaaaad708a6b8 - std::sys_common::backtrace::__rust_begin_short_backtrace::h5c4c394bb284ddb5
32: 0xaaaad708a3a4 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc419499b44d1e34f
33: 0xaaaad705f0a8 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hd6a6c972c6220bcb
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
34: 0xaaaad705f0a8 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h3b759ee01954e7b0
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
35: 0xaaaad705f0a8 - std::sys::unix::thread::Thread::new::thread_start::hd2c1311433b59ac2
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys/unix/thread.rs:108:17
36: 0xffff88fbee18 - <unknown>
37: 0xffff89027e9c - <unknown>
38: 0x0 - <unknown>
2023-11-21T11:50:31.597101Z ERROR sink{component_kind="sink" component_id=request_s3 component_type=aws_s3 component_name=request_s3}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2023-11-21T11:50:31.597150Z INFO vector: Vector has stopped.
2023-11-21T11:50:31.599252Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="report, report_s3, kafka_remap, request, kafka_report, kafka_request" time_remaining="119 seconds left"
2023-11-21T11:50:31.857252Z ERROR sink{component_kind="sink" component_id=request_s3 component_type=aws_s3 component_name=request_s3}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=18446744073709401251 intentional=false reason=corrupted_events stage=0
2023-11-21T11:50:36.598370Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="report_s3" time_remaining="114 seconds left"
2023-11-21T11:50:41.598278Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="report_s3" time_remaining="109 seconds left"
2023-11-21T11:50:46.597939Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="report_s3" time_remaining="104 seconds left"
@yanjun-li @Abhinav04 Is it possible that multiple Vector pods are sharing the same persistent volume?
@StephenWakely In our case it is not possible. We have only one pod per node and only one volume gets attached to the node. Hence volumes are not shared amongst pods.
@StephenWakely No, each pod has a different persistent volume.
We have also observed the same errors with the S3 sink while running Vector 0.35.0:
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
In our case, the data_dir is also pointing to a persistent volume in EKS, and each pod has its own dedicated volume since we use StatefulSets.
We also encountered this problem, and there was nothing abnormal in the Journald. It's very strange question, I have no idea why this happened. Not sure if it has anything to do with the kernel or file system?
thread 'vector-worker' panicked at /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33:
Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-09-08T22:30:18.908466Z ERROR sink{component_kind="sink" component_id=gcp_vevct component_type=gcp_cloud_storage}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
We also are having this issue, EKS, PV. Here's the stack trace I got before it errored. To give some context this was after we had a corrupted buffer which I manually cleared as the pod was in a backoff loop then a lot of events to send to s3 but it suddenly dies and is in another backoff loop.
Our pipeline is file source -> transform -> s3.
The only solution I've got so far to get this working is getting rid of both the buffer, checkpoints and all the files I had in the directory that the file source is looking at.
2025-05-01T11:01:36.137142Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_common::finalizer: FinalizerSet task ended prematurely. error=channel closed
thread 'vector-worker' panicked at /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33:
Reader encountered unrecoverable error: Io { source: Custom { kind: Other, error: "task was cancelled" } }
stack backtrace:
0: 0x5573633a66d0 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::hfc616348d9ad0abc
1: 0x5573627a7383 - core::fmt::write::h7ca648217bc79799
2: 0x5573633a5c02 - std::io::Write::write_fmt::h7960c58bfa5ccbcb
3: 0x5573633a6493 - std::sys::backtrace::BacktraceLock::print::h3fb349e80cbe0423
4: 0x5573633a5a00 - std::panicking::rust_panic_with_hook::h66e909d048c263a9
5: 0x5573633ed5f8 - std::panicking::begin_panic_handler::{{closure}}::h8d9aa8be7e8634cf
6: 0x5573633ed559 - std::sys::backtrace::__rust_end_short_backtrace::h7d7e47ef99abf6aa
7: 0x5573633f09cc - rust_begin_unwind
8: 0x557361d9cc5f - core::panicking::panic_fmt::hf8ffc7c15bfb58a0
9: 0x5573660a5722 - vector_buffers::topology::channel::receiver::ReceiverAdapter<T>::next::{{closure}}::h3da564a569227dea
10: 0x5573660a4971 - vector_buffers::topology::channel::receiver::BufferReceiver<T>::next::{{closure}}::hf652ec9c9f8e203c
11: 0x5573660a458b - vector_buffers::topology::channel::receiver::make_recv_future::{{closure}}::h220a329c70b5f3d2
12: 0x5573660ee1c9 - <vector_buffers::topology::channel::receiver::BufferReceiverStream<T> as futures_core::stream::Stream>::poll_next::h49341b308b496c89
13: 0x5573660ed61d - <vector::utilization::Utilization<S> as futures_core::stream::Stream>::poll_next::hffc2b01712b7cf85
14: 0x5573660ed09c - <stream_cancel::combinator::TakeUntilIf<S,F> as futures_core::stream::Stream>::poll_next::h823d85b19af45836
15: 0x55736367208d - <futures_util::stream::stream::FlatMap<St,U,F> as futures_core::stream::Stream>::poll_next::ha6304aaabfdd13c6
16: 0x557363aee935 - <futures_util::stream::stream::filter_map::FilterMap<St,Fut,F> as futures_core::stream::Stream>::poll_next::h8c4b6f234c98a4c7
17: 0x557363adf3d1 - vector_stream::driver::Driver<St,Svc>::run::{{closure}}::h1ba573fcfd3fe74c
18: 0x557363addab3 - <vector::sinks::s3_common::sink::S3Sink<Svc,RB,P> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}::h9e7fc4e824272520
19: 0x557363adcb05 - <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}::h4d8bd5527e1df817
20: 0x5573660ebdf6 - vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}::hc3a15dfaf978ce2a
21: 0x55736606d6c0 - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hbee5bc54111bb986
22: 0x55736606c409 - tokio::runtime::task::raw::poll::h39de2473018cf16b
23: 0x55736345ea50 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::hdf61bbc159af48d6
24: 0x557363468482 - tokio::runtime::task::raw::poll::heba8bf734208621e
25: 0x55736343a969 - std::sys::backtrace::__rust_begin_short_backtrace::hd06e010e177c7d50
26: 0x557363442bd9 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hae7d1daa6693ad7c
27: 0x5573633f339b - std::sys::pal::unix::thread::Thread::new::thread_start::hc1bbaaa2227026bb
28: 0x7ffb154e21f5 - <unknown>
29: 0x7ffb1556289c - <unknown>
30: 0x0 - <unknown>
2025-05-01T11:01:36.434904Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
To add more context often the errors are like this and then it gracefully shuts down.
2025-05-01T11:29:40.733385Z ERROR transform{component_kind="transform" component_id=json_parser component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2025-05-01T11:29:41.081633Z ERROR transform{component_kind="transform" component_id=json_parser component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: writer entered inconsistent state: failed to decode record immediately after encoding it.
2025-05-01T11:29:41.081698Z INFO vector: Vector has stopped.
2025-05-01T11:29:41.081701Z ERROR source{component_kind="source" component_id=tail component_type=file}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2025-05-01T11:29:41.083055Z ERROR vector::internal_events::common: Failed to forward event(s), downstream is closed. error_code="stream_closed" error_type="writer_failed" stage="sending" internal_log_rate_limit=true
2025-05-01T11:29:41.083076Z ERROR vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=0 reason="Downstream is closed." internal_log_rate_limit=true
2025-05-01T11:29:41.087273Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="metrics, json_parser, dropped_files, filter_metrics, prometheus, s3, tail" time_remaining="59 seconds left"
2025-05-01T11:29:46.083372Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="54 seconds left"
Hey all,
Wanted to pop in and say I'm seeing this issue as well happened to me on 4 or 5 nodes twice in the last two days. I have a vector sink and the aws_s3 sink. This is also running on ec2 nodes (no containers) with mounted EBS volumes.
I wanted to add some more info here, since moving from the disk buffer to memory we have had no issues, exactly the same config but removing the disk buffer config.