vector icon indicating copy to clipboard operation
vector copied to clipboard

gcp_cloud_storage sink: thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io' - Vector Aggregator

Open Abhinav04 opened this issue 2 years ago • 17 comments

Problem

Vector aggregator panicks and gets restarted for gcp_cloud_storage sink intermittently

Vector Version vector 0.26.0

Log

2023-11-03T08:16:57.126512Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-03T08:16:57.144758Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-03T08:16:57.160504Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-03T13:47:06.246087Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=18446744073709487194 intentional=false reason=corrupted_events stage=0
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2023-11-03T13:47:08.979414Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2023-11-03T13:47:08.979575Z  INFO vector: Vector has stopped.

Additional Context

Flow:

Kubernetes Pods --> Vector Agent --> Vector Aggregator --> Google Cloud Storage Bucket

Vector aggregator works fine throughout but fails intermittently after a few hours and gets restarted with the above error. Good amount of CPU and Memory is available on nodes running vector aggregator pods.

Configuration

customConfig: 
  data_dir: /vector-data-dir
  api:
    enabled: true
    address: 0.0.0.0:8686
    playground: false
  sources:
    agent:
      address: 0.0.0.0:6000
      type: vector
    vector_aggregator_logs:
      type: "internal_logs"
  transforms:
    agent_remap_logs:
      type: remap
      inputs:
        - agent
      source: |-
        structured,err = parse_json(.message)
        if err == null {
          .,err = merge(., structured)
          if err == null{
            del(.message)
          }
        }
    application_logs:
      type: filter
      inputs:
        - agent_remap_logs
      condition:
        type: "vrl"
        source: " .source_type == \"kubernetes_logs\""
    vector_agent_logs:
      type: filter
      inputs:
        - application_logs
      condition:
        type: "vrl"
        source: " .source_type == \"internal_logs\""
  sinks:
    cloud_storage:
      type: gcp_cloud_storage
      inputs: [ application_logs ]
      compression: gzip
      buffer:
        max_size: 4294967296
        type: disk
        when_full: block
      batch:
        max_bytes: 209715200
        timeout_secs: 60
      request:
        concurrency: adaptive
        rate_limit_duration_secs: 60
     shorthand
      encoding:
        codec: json
      framing:
        method: "newline_delimited"
      metadata:
        Content-Encoding: ''
      bucket: abhinav-new-logs-bucket
      key_prefix: 'k8s/year=%Y/month=%m/day=%d/hour=%H/%M_'

No response

Version

vector-0.26.0

Abhinav04 avatar Nov 03 '23 14:11 Abhinav04

The error is likely occurring because the buffer file that Vector is using ceases to exist. How is /vector-data-dir mounted to the pod?

StephenWakely avatar Nov 03 '23 14:11 StephenWakely

The error is likely occurring because the buffer file that Vector is using ceases to exist. How is /vector-data-dir mounted to the pod?

@StephenWakely /vector-data-dir is mounted in the form of a persistent disk (Persistent Volume). Vector aggregator is deployed as a stateful set backed by persistent disks.

Abhinav04 avatar Nov 03 '23 15:11 Abhinav04

Can you think of any reason why Vector may lose access to that disk? When Vector is restarted do you have any data loss?

StephenWakely avatar Nov 03 '23 15:11 StephenWakely

@StephenWakely We have huge volume of data going through and it is tough for us to check if there's a data loss during that particular restart. However, I assume the below log line hints towards data loss : vector_buffers::internal_events: Events dropped. count=18446744073709487194 intentional=false reason=corrupted_events stage=0

As per my understanding, Vector will not lose access to the disk. The disk is a GCP managed Persistent Volume and hence there is a minimal chance of that becoming unavailable.

Abhinav04 avatar Nov 03 '23 16:11 Abhinav04

The count is a bit suspicious. That's a lot of events to drop!

Are you able to confirm if this error occurs on the latest version of Vector (0.33.1)? 0.26 is quite old.

StephenWakely avatar Nov 07 '23 15:11 StephenWakely

@StephenWakely We have multiple other clusters which are running Vector 0.32.0 and similar error is being seen on most of the clusters.

Here's a log snippet from one of the clusters:

2023-11-04T15:35:24.377344Z  INFO vector::app: Log level is enabled. level="INFO"
2023-11-04T15:35:24.377372Z  WARN vector::app: DEPRECATED The openssl legacy provider provides algorithms and key sizes no longer recommended for use.
2023-11-04T15:35:24.377676Z  INFO vector::app: Loaded openssl legacy provider.
2023-11-04T15:35:24.378435Z  INFO vector::config::watcher: Creating configuration file watcher.
2023-11-04T15:35:24.378790Z  INFO vector::config::watcher: Watching configuration files.
2023-11-04T15:35:24.378873Z  INFO vector::app: Loading configs. paths=["/etc/vector"]
2023-11-04T15:35:24.905010Z  INFO vector::topology::running: Running healthchecks.
2023-11-04T15:35:24.905209Z  INFO vector: Vector has started. debug="false" version="0.32.0" arch="x86_64" revision="1b403e1 2023-08-15 14:56:36.089460954"
2023-11-04T15:35:24.905270Z  INFO source{component_kind="source" component_id=agent component_type=vector component_name=agent}: vector::sources::util::grpc: Building gRPC server. address=0.0.0.0:6000
2023-11-04T15:35:24.907948Z  INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=off
2023-11-04T15:35:26.282392Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-04T15:35:26.311784Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-04T15:35:26.409646Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-07T06:45:57.670473Z  WARN sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:request{request_id=122254}: vector::sinks::util::retries: Retrying after response. reason=500 Internal Server Error internal_log_rate_limit=true
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2023-11-07T09:57:20.565298Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=18446744073709508238 intentional=false reason=corrupted_events stage=0
2023-11-07T09:57:21.474346Z ERROR sink{component_kind="sink" component_id=cloud_storage component_type=gcp_cloud_storage component_name=cloud_storage}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2023-11-07T09:57:21.474457Z  INFO vector: Vector has stopped.

Abhinav04 avatar Nov 08 '23 04:11 Abhinav04

I also encountered this problem when using s3 sink. Any suggestions?

yanjun-li avatar Nov 21 '23 07:11 yanjun-li

I also encountered this problem when using s3 sink. Any suggestions?

The issue is not specific to any particular sink. The issue seems to be when the Vector loses access the file that it is storing the buffer on. Can you provide more details about the volume the data_dir is pointing to?

StephenWakely avatar Nov 21 '23 13:11 StephenWakely

@StephenWakely The data_dir is pointing to a persistent volume in EKS.

Here is the log:

2023-11-16T19:34:55.434011Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-11-16T19:34:55.434634Z  INFO vector::config::watcher: Creating configuration file watcher.
2023-11-16T19:34:55.434823Z  INFO vector::config::watcher: Watching configuration files.
2023-11-16T19:34:55.434867Z  INFO vector::app: Loading configs. paths=["/etc/vector"]
2023-11-16T19:34:58.313122Z  INFO vector::topology::running: Running healthchecks.
2023-11-16T19:34:58.313284Z  INFO vector: Vector has started. debug="false" version="0.33.0" arch="aarch64" revision="89605fb 2023-09-27 14:18:24.180809939"
2023-11-16T19:34:58.314661Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off
2023-11-16T19:34:58.318700Z  INFO vector::topology::builder: Healthcheck disabled.
2023-11-16T19:34:58.486112Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-16T19:34:58.486309Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-16T19:34:58.493455Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-17T06:48:24.361852Z  WARN sink{component_kind="sink" component_id=block_s3 component_type=aws_s3 component_name=block_s3}:request{request_id=2568}: vector::sinks::util::retries: Retrying after error. error=dispatch failure internal_log_rate_limit=true
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
stack backtrace:
   0:     0xaaaac6f694a0 - std::backtrace_rs::backtrace::libunwind::trace::hc366d13e3075cfeb
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0xaaaac6f694a0 - std::backtrace_rs::backtrace::trace_unsynchronized::hb9a35e1f782586fe
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0xaaaac6f694a0 - std::sys_common::backtrace::_print_fmt::hae0d6c377f414e8d
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:65:5
   3:     0xaaaac6f694a0 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h0f8916ea7cac138e
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:44:22
   4:     0xaaaac6738860 - core::fmt::rt::Argument::fmt::hf491ec0da00c9b40
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/rt.rs:138:9
   5:     0xaaaac6738860 - core::fmt::write::h075f235dea7f2ae2
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:1094:21
   6:     0xaaaac6f3107c - std::io::Write::write_fmt::hed461f96db027bb0
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/io/mod.rs:1714:15
   7:     0xaaaac6f6c104 - std::sys_common::backtrace::_print::h9f6cc87d40ba87f7
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:47:5
   8:     0xaaaac6f6c104 - std::sys_common::backtrace::print::hc36324e2f472d029
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:34:9
   9:     0xaaaac6f6bd04 - std::panicking::default_hook::{{closure}}::h8d8f8741b904046c
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:269:22
  10:     0xaaaac6f6cc2c - std::panicking::default_hook::h6fd0090c1af86ae9
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:288:9
  11:     0xaaaac6f6cc2c - std::panicking::rust_panic_with_hook::h25082e1f49b04873
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:705:13
  12:     0xaaaac6f6c760 - std::panicking::begin_panic_handler::{{closure}}::hd64c14bd1869359d
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:597:13
  13:     0xaaaac6f6c6d4 - std::sys_common::backtrace::__rust_end_short_backtrace::h8613664e493d4a0e
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:151:18
  14:     0xaaaac6f6c6c8 - rust_begin_unwind
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
  15:     0xaaaac617ff10 - core::panicking::panic_fmt::h354dad60a2e185d5
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
  16:     0xaaaac8affef4 - vector_buffers::topology::channel::receiver::ReceiverAdapter<T>::next::{{closure}}::hd01ea73e57558309
  17:     0xaaaac8aff180 - vector_buffers::topology::channel::receiver::BufferReceiver<T>::next::{{closure}}::hb2fd72c713890c51
  18:     0xaaaac8e33ae0 - vector_buffers::topology::channel::receiver::make_recv_future::{{closure}}::h8bde19c02063e358
  19:     0xaaaac8e51bf4 - <vector_buffers::topology::channel::receiver::BufferReceiverStream<T> as futures_core::stream::Stream>::poll_next::hc22bfbb6df1b82e8
  20:     0xaaaac8e512b8 - <vector::utilization::Utilization<S> as futures_core::stream::Stream>::poll_next::h797c2c8c6db0517e
  21:     0xaaaac8e50d18 - <stream_cancel::combinator::TakeUntilIf<S,F> as futures_core::stream::Stream>::poll_next::h4d57577fca8bce3f
  22:     0xaaaac717fc94 - <futures_util::stream::stream::FlatMap<St,U,F> as futures_core::stream::Stream>::poll_next::h317498fdf5ec7754
  23:     0xaaaac749b570 - <vector_core::stream::partitioned_batcher::PartitionedBatcher<St,Prt,KT> as futures_core::stream::Stream>::poll_next::hcaafe2d3da438c2f
  24:     0xaaaac7490a90 - <vector::sinks::s3_common::sink::S3Sink<Svc,RB> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}::h62aa3ccbf71cb5c0
  25:     0xaaaac748e77c - <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}::h8a41f58bf9c39efa
  26:     0xaaaac8e4ff38 - vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}::hb14f5ad14bc172c1
  27:     0xaaaac8e132ec - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hd774630cd39b8642
  28:     0xaaaac8e12b88 - tokio::runtime::task::raw::poll::hc33b78532c44772c
  29:     0xaaaac6fb00b4 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::hbfa7fdd8cfe3f82d
  30:     0xaaaac6fb5000 - tokio::runtime::task::raw::poll::h585308cd792d3405
  31:     0xaaaac6f9a6b8 - std::sys_common::backtrace::__rust_begin_short_backtrace::h5c4c394bb284ddb5
  32:     0xaaaac6f9a3a4 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc419499b44d1e34f
  33:     0xaaaac6f6f0a8 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hd6a6c972c6220bcb
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
  34:     0xaaaac6f6f0a8 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h3b759ee01954e7b0
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
  35:     0xaaaac6f6f0a8 - std::sys::unix::thread::Thread::new::thread_start::hd2c1311433b59ac2
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys/unix/thread.rs:108:17
  36:     0xffffbc6bee18 - <unknown>
  37:     0xffffbc727e9c - <unknown>
  38:                0x0 - <unknown>
2023-11-20T08:32:01.454452Z ERROR sink{component_kind="sink" component_id=block_s3 component_type=aws_s3 component_name=block_s3}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2023-11-20T08:32:01.454504Z  INFO vector: Vector has stopped.
2023-11-20T08:32:01.455419Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="kafka_remap, kafka_json, block, kafka_block, seller_s3, buyer_s3, buyer, prometheus, seller" time_remaining="119 seconds left"
2023-11-20T08:32:02.276422Z ERROR sink{component_kind="sink" component_id=block_s3 component_type=aws_s3 component_name=block_s3}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=18446744073708353028 intentional=false reason=corrupted_events stage=0
2023-11-20T08:32:06.455329Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="seller_s3, buyer_s3" time_remaining="114 seconds left"
2023-11-20T08:32:11.455850Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="109 seconds left"
2023-11-20T08:32:16.455884Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="104 seconds left"
2023-11-20T08:32:21.455404Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="99 seconds left"
2023-11-20T08:32:26.455498Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="94 seconds left"
2023-11-20T08:32:31.455543Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="89 seconds left"
2023-11-20T08:32:36.455425Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="84 seconds left"
2023-11-20T08:32:41.455380Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="79 seconds left"
2023-11-20T08:32:46.456226Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="74 seconds left"
2023-11-20T08:32:51.455299Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="buyer_s3" time_remaining="69 seconds left"
╭─liyanjun at liyanjun的MacBook Pro in ~/code/rust 23-11-22 - 10:31:21
╰─○ k get po
NAME                              READY   STATUS             RESTARTS          AGE
report-geoedge-5c768796b6-8wgrc   0/2     CrashLoopBackOff   424 (3m51s ago)   17h
report-geoedge-5c768796b6-hmpk7   2/2     Running            0                 17h
vector-s3-normal-0                1/1     Running            2 (43h ago)       5d19h
vector-s3-normal-1                1/1     Running            1 (41h ago)       5d7h
vector-s3-normal-2                1/1     Running            2 (43h ago)       5d19h
vector-s3-normal-3                1/1     Running            0                 10h
vector-s3-normal-4                1/1     Running            1 (46h ago)       5d1h
vector-s3-primary-0               1/1     Running            1 (14h ago)       18h
vector-s3-primary-1               1/1     Running            0                 10h
vector-s3-primary-2               1/1     Running            0                 143m
vector-s3-primary-3               1/1     Running            0                 19h
╭─liyanjun at liyanjun的MacBook Pro in ~/code/rust 23-11-22 - 10:31:47
╰─○ k logs vector-s3-primary-0 -p
2023-11-21T07:33:00.820662Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-11-21T07:33:00.821223Z  INFO vector::config::watcher: Creating configuration file watcher.
2023-11-21T07:33:00.821417Z  INFO vector::config::watcher: Watching configuration files.
2023-11-21T07:33:00.821489Z  INFO vector::app: Loading configs. paths=["/etc/vector"]
2023-11-21T07:33:01.921578Z  INFO vector::topology::running: Running healthchecks.
2023-11-21T07:33:01.921740Z  INFO vector: Vector has started. debug="false" version="0.33.0" arch="aarch64" revision="89605fb 2023-09-27 14:18:24.180809939"
2023-11-21T07:33:01.922879Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off
2023-11-21T07:33:01.927194Z  INFO vector::topology::builder: Healthcheck disabled.
2023-11-21T07:33:02.043988Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-21T07:33:02.047747Z  INFO vector::topology::builder: Healthcheck passed.
thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33
stack backtrace:
   0:     0xaaaad70594a0 - std::backtrace_rs::backtrace::libunwind::trace::hc366d13e3075cfeb
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0xaaaad70594a0 - std::backtrace_rs::backtrace::trace_unsynchronized::hb9a35e1f782586fe
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0xaaaad70594a0 - std::sys_common::backtrace::_print_fmt::hae0d6c377f414e8d
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:65:5
   3:     0xaaaad70594a0 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h0f8916ea7cac138e
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:44:22
   4:     0xaaaad6828860 - core::fmt::rt::Argument::fmt::hf491ec0da00c9b40
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/rt.rs:138:9
   5:     0xaaaad6828860 - core::fmt::write::h075f235dea7f2ae2
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:1094:21
   6:     0xaaaad702107c - std::io::Write::write_fmt::hed461f96db027bb0
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/io/mod.rs:1714:15
   7:     0xaaaad705c104 - std::sys_common::backtrace::_print::h9f6cc87d40ba87f7
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:47:5
   8:     0xaaaad705c104 - std::sys_common::backtrace::print::hc36324e2f472d029
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:34:9
   9:     0xaaaad705bd04 - std::panicking::default_hook::{{closure}}::h8d8f8741b904046c
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:269:22
  10:     0xaaaad705cc2c - std::panicking::default_hook::h6fd0090c1af86ae9
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:288:9
  11:     0xaaaad705cc2c - std::panicking::rust_panic_with_hook::h25082e1f49b04873
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:705:13
  12:     0xaaaad705c760 - std::panicking::begin_panic_handler::{{closure}}::hd64c14bd1869359d
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:597:13
  13:     0xaaaad705c6d4 - std::sys_common::backtrace::__rust_end_short_backtrace::h8613664e493d4a0e
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:151:18
  14:     0xaaaad705c6c8 - rust_begin_unwind
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
  15:     0xaaaad626ff10 - core::panicking::panic_fmt::h354dad60a2e185d5
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
  16:     0xaaaad8befef4 - vector_buffers::topology::channel::receiver::ReceiverAdapter<T>::next::{{closure}}::hd01ea73e57558309
  17:     0xaaaad8bef180 - vector_buffers::topology::channel::receiver::BufferReceiver<T>::next::{{closure}}::hb2fd72c713890c51
  18:     0xaaaad8f23ae0 - vector_buffers::topology::channel::receiver::make_recv_future::{{closure}}::h8bde19c02063e358
  19:     0xaaaad8f41bf4 - <vector_buffers::topology::channel::receiver::BufferReceiverStream<T> as futures_core::stream::Stream>::poll_next::hc22bfbb6df1b82e8
  20:     0xaaaad8f412b8 - <vector::utilization::Utilization<S> as futures_core::stream::Stream>::poll_next::h797c2c8c6db0517e
  21:     0xaaaad8f40d18 - <stream_cancel::combinator::TakeUntilIf<S,F> as futures_core::stream::Stream>::poll_next::h4d57577fca8bce3f
  22:     0xaaaad726fc94 - <futures_util::stream::stream::FlatMap<St,U,F> as futures_core::stream::Stream>::poll_next::h317498fdf5ec7754
  23:     0xaaaad758b570 - <vector_core::stream::partitioned_batcher::PartitionedBatcher<St,Prt,KT> as futures_core::stream::Stream>::poll_next::hcaafe2d3da438c2f
  24:     0xaaaad7580a90 - <vector::sinks::s3_common::sink::S3Sink<Svc,RB> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}::h62aa3ccbf71cb5c0
  25:     0xaaaad757e77c - <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}::h8a41f58bf9c39efa
  26:     0xaaaad8f3ff38 - vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}::hb14f5ad14bc172c1
  27:     0xaaaad8f032ec - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hd774630cd39b8642
  28:     0xaaaad8f02b88 - tokio::runtime::task::raw::poll::hc33b78532c44772c
  29:     0xaaaad70a00b4 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::hbfa7fdd8cfe3f82d
  30:     0xaaaad70a5000 - tokio::runtime::task::raw::poll::h585308cd792d3405
  31:     0xaaaad708a6b8 - std::sys_common::backtrace::__rust_begin_short_backtrace::h5c4c394bb284ddb5
  32:     0xaaaad708a3a4 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc419499b44d1e34f
  33:     0xaaaad705f0a8 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hd6a6c972c6220bcb
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
  34:     0xaaaad705f0a8 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h3b759ee01954e7b0
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
  35:     0xaaaad705f0a8 - std::sys::unix::thread::Thread::new::thread_start::hd2c1311433b59ac2
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys/unix/thread.rs:108:17
  36:     0xffff88fbee18 - <unknown>
  37:     0xffff89027e9c - <unknown>
  38:                0x0 - <unknown>
2023-11-21T11:50:31.597101Z ERROR sink{component_kind="sink" component_id=request_s3 component_type=aws_s3 component_name=request_s3}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2023-11-21T11:50:31.597150Z  INFO vector: Vector has stopped.
2023-11-21T11:50:31.599252Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="report, report_s3, kafka_remap, request, kafka_report, kafka_request" time_remaining="119 seconds left"
2023-11-21T11:50:31.857252Z ERROR sink{component_kind="sink" component_id=request_s3 component_type=aws_s3 component_name=request_s3}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=18446744073709401251 intentional=false reason=corrupted_events stage=0
2023-11-21T11:50:36.598370Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="report_s3" time_remaining="114 seconds left"
2023-11-21T11:50:41.598278Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="report_s3" time_remaining="109 seconds left"
2023-11-21T11:50:46.597939Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="report_s3" time_remaining="104 seconds left"

yanjun-li avatar Nov 22 '23 02:11 yanjun-li

@yanjun-li @Abhinav04 Is it possible that multiple Vector pods are sharing the same persistent volume?

StephenWakely avatar Nov 23 '23 10:11 StephenWakely

@StephenWakely In our case it is not possible. We have only one pod per node and only one volume gets attached to the node. Hence volumes are not shared amongst pods.

Abhinav04 avatar Nov 23 '23 12:11 Abhinav04

@StephenWakely No, each pod has a different persistent volume.

yanjun-li avatar Dec 04 '23 06:12 yanjun-li

We have also observed the same errors with the S3 sink while running Vector 0.35.0:

thread 'vector-worker' panicked at 'Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }', /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33

In our case, the data_dir is also pointing to a persistent volume in EKS, and each pod has its own dedicated volume since we use StatefulSets.

yalinglee avatar Jun 13 '24 04:06 yalinglee

We also encountered this problem, and there was nothing abnormal in the Journald. It's very strange question, I have no idea why this happened. Not sure if it has anything to do with the kernel or file system?

thread 'vector-worker' panicked at /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33:
Reader encountered unrecoverable error: Io { source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-09-08T22:30:18.908466Z ERROR sink{component_kind="sink" component_id=gcp_vevct component_type=gcp_cloud_storage}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.

zaldnoay avatar Sep 09 '24 07:09 zaldnoay

We also are having this issue, EKS, PV. Here's the stack trace I got before it errored. To give some context this was after we had a corrupted buffer which I manually cleared as the pod was in a backoff loop then a lot of events to send to s3 but it suddenly dies and is in another backoff loop.

Our pipeline is file source -> transform -> s3.

The only solution I've got so far to get this working is getting rid of both the buffer, checkpoints and all the files I had in the directory that the file source is looking at.

2025-05-01T11:01:36.137142Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_common::finalizer: FinalizerSet task ended prematurely. error=channel closed
thread 'vector-worker' panicked at /project/lib/vector-buffers/src/topology/channel/receiver.rs:59:33:
Reader encountered unrecoverable error: Io { source: Custom { kind: Other, error: "task was cancelled" } }
stack backtrace:
  0:     0x5573633a66d0 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::hfc616348d9ad0abc
  1:     0x5573627a7383 - core::fmt::write::h7ca648217bc79799                                                                                                                      
  2:     0x5573633a5c02 - std::io::Write::write_fmt::h7960c58bfa5ccbcb                                                                                                             
  3:     0x5573633a6493 - std::sys::backtrace::BacktraceLock::print::h3fb349e80cbe0423                                                                                             
  4:     0x5573633a5a00 - std::panicking::rust_panic_with_hook::h66e909d048c263a9                                                                                                  
  5:     0x5573633ed5f8 - std::panicking::begin_panic_handler::{{closure}}::h8d9aa8be7e8634cf                                                                                      
  6:     0x5573633ed559 - std::sys::backtrace::__rust_end_short_backtrace::h7d7e47ef99abf6aa                                                                                       
  7:     0x5573633f09cc - rust_begin_unwind                                                                                                                                        
  8:     0x557361d9cc5f - core::panicking::panic_fmt::hf8ffc7c15bfb58a0                                                                                                            
  9:     0x5573660a5722 - vector_buffers::topology::channel::receiver::ReceiverAdapter<T>::next::{{closure}}::h3da564a569227dea                                                   
  10:     0x5573660a4971 - vector_buffers::topology::channel::receiver::BufferReceiver<T>::next::{{closure}}::hf652ec9c9f8e203c                                                    
  11:     0x5573660a458b - vector_buffers::topology::channel::receiver::make_recv_future::{{closure}}::h220a329c70b5f3d2                                                           
  12:     0x5573660ee1c9 - <vector_buffers::topology::channel::receiver::BufferReceiverStream<T> as futures_core::stream::Stream>::poll_next::h49341b308b496c89                    
  13:     0x5573660ed61d - <vector::utilization::Utilization<S> as futures_core::stream::Stream>::poll_next::hffc2b01712b7cf85                                                     
  14:     0x5573660ed09c - <stream_cancel::combinator::TakeUntilIf<S,F> as futures_core::stream::Stream>::poll_next::h823d85b19af45836                                             
  15:     0x55736367208d - <futures_util::stream::stream::FlatMap<St,U,F> as futures_core::stream::Stream>::poll_next::ha6304aaabfdd13c6                                           
  16:     0x557363aee935 - <futures_util::stream::stream::filter_map::FilterMap<St,Fut,F> as futures_core::stream::Stream>::poll_next::h8c4b6f234c98a4c7                           
  17:     0x557363adf3d1 - vector_stream::driver::Driver<St,Svc>::run::{{closure}}::h1ba573fcfd3fe74c                                                                              
  18:     0x557363addab3 - <vector::sinks::s3_common::sink::S3Sink<Svc,RB,P> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}::h9e7fc4e824272520     
  19:     0x557363adcb05 - <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}::h4d8bd5527e1df817        
  20:     0x5573660ebdf6 - vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}::hc3a15dfaf978ce2a                                                            
  21:     0x55736606d6c0 - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hbee5bc54111bb986                                                         
  22:     0x55736606c409 - tokio::runtime::task::raw::poll::h39de2473018cf16b                                                                                                      
  23:     0x55736345ea50 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::hdf61bbc159af48d6                                                                   
  24:     0x557363468482 - tokio::runtime::task::raw::poll::heba8bf734208621e                                                                                                      
  25:     0x55736343a969 - std::sys::backtrace::__rust_begin_short_backtrace::hd06e010e177c7d50                                                                                    
  26:     0x557363442bd9 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hae7d1daa6693ad7c                                                                                
  27:     0x5573633f339b - std::sys::pal::unix::thread::Thread::new::thread_start::hc1bbaaa2227026bb                                                                               
  28:     0x7ffb154e21f5 - <unknown>                                                                                                                                               
  29:     0x7ffb1556289c - <unknown>                                                                                                                                               
  30:                0x0 - <unknown>                                                                                                                                             
2025-05-01T11:01:36.434904Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted. 

To add more context often the errors are like this and then it gracefully shuts down.

2025-05-01T11:29:40.733385Z ERROR transform{component_kind="transform" component_id=json_parser component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2025-05-01T11:29:41.081633Z ERROR transform{component_kind="transform" component_id=json_parser component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: writer entered inconsistent state: failed to decode record immediately after encoding it. 
2025-05-01T11:29:41.081698Z  INFO vector: Vector has stopped.
2025-05-01T11:29:41.081701Z ERROR source{component_kind="source" component_id=tail component_type=file}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2025-05-01T11:29:41.083055Z ERROR vector::internal_events::common: Failed to forward event(s), downstream is closed. error_code="stream_closed" error_type="writer_failed" stage="sending" internal_log_rate_limit=true                                                                                                                                           
2025-05-01T11:29:41.083076Z ERROR vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=0 reason="Downstream is closed." internal_log_rate_limit=true                                                                                                                                                                   
2025-05-01T11:29:41.087273Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="metrics, json_parser, dropped_files, filter_metrics, prometheus, s3, tail" time_remaining="59 seconds left"                                                                                                                    
2025-05-01T11:29:46.083372Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="54 seconds left"

trahim avatar May 01 '25 11:05 trahim

Hey all,

Wanted to pop in and say I'm seeing this issue as well happened to me on 4 or 5 nodes twice in the last two days. I have a vector sink and the aws_s3 sink. This is also running on ec2 nodes (no containers) with mounted EBS volumes.

satellite-no avatar May 22 '25 21:05 satellite-no

I wanted to add some more info here, since moving from the disk buffer to memory we have had no issues, exactly the same config but removing the disk buffer config.

trahim avatar Jun 09 '25 09:06 trahim