vector
vector copied to clipboard
Kafka sink failure with 0.35.0
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We've upgraded our vector pods from v0.30.0 (chart 0.26.0) to 0.35.0 (chart: 0.30.0). Vector fails and the pod restart (crash loop backoff).
It seems related to the kafka sink (see related logs). I've tested the 0.34.2 (chart: 0.29.3) successfully.
Configuration
---
log_schema:
source_type: x-source_type
timestamp_key: '@timestamp'
timezone: Etc/UTC
sources:
gateway:
type: logstash
address: 0.0.0.0:5044
transforms:
remap_source_gateway:
type: remap
inputs: [gateway]
source: |
%kafka_topic = "systemlogs"
sinks:
kafka:
type: kafka
inputs: [remap_source_gateway]
bootstrap_servers: ${KAFKA_HOST}
topic: '{{ %kafka_topic }}'
compression: none
encoding:
codec: json
healthcheck:
enabled: false
Version
0.35.0
Debug Output
2024-01-15T15:01:48.241263Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.18.95:56433}: vector::sources::util::net::tcp: Connection closed.
thread 'vector-worker' panicked at 'called `Option::unwrap()` on a `None` value', /cargo/registry/src/index.crates.io-6f17d22bba15001f/lockfree-object-pool-0.1.4/src/linear_page.rs:48:30
stack backtrace:
0: 0x128a36c - std::backtrace_rs::backtrace::libunwind::trace::hc7cb25c924fb6eba
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: 0x128a36c - std::backtrace_rs::backtrace::trace_unsynchronized::h5c4ce001d8eb6854
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x128a36c - std::sys_common::backtrace::_print_fmt::h14d9ae8d6479967a
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:65:5
3: 0x128a36c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h02c30a315588511a
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:44:22
4: 0xa13948 - core::fmt::rt::Argument::fmt::hd5ccf3711fc53a4f
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/rt.rs:138:9
5: 0xa13948 - core::fmt::write::h2876fd52ca55dbfa
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:1094:21
6: 0x12523e4 - std::io::Write::write_fmt::h3391f6b02b293028
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/io/mod.rs:1714:15
7: 0x128cfc4 - std::sys_common::backtrace::_print::h23980dcdfc2fe28a
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:47:5
8: 0x128cfc4 - std::sys_common::backtrace::print::h49b97a66db4af001
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:34:9
9: 0x128cbcc - std::panicking::default_hook::{{closure}}::h7138645749f92527
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:269:22
10: 0x128db04 - std::panicking::default_hook::h691846cb3f034db3
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:288:9
11: 0x128db04 - std::panicking::rust_panic_with_hook::hb3c04ea23b04721e
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:705:13
12: 0x128d604 - std::panicking::begin_panic_handler::{{closure}}::hc4edb1e6fd5a6b0f
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:595:13
13: 0x128d59c - std::sys_common::backtrace::__rust_end_short_backtrace::hb6a6277e20eb73e1
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:151:18
14: 0x128d590 - rust_begin_unwind
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
15: 0x424870 - core::panicking::panic_fmt::hfcaa20ee741d1cc5
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
16: 0x42498c - core::panicking::panic::h31e7a12887ab7d97
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:117:5
17: 0x3348b64 - <tracing_subscriber::layer::layered::Layered<L,S> as tracing_core::subscriber::Subscriber>::new_span::h9a00643ebd4f75b8
18: 0x334b124 - <tracing_subscriber::layer::layered::Layered<L,S> as tracing_core::subscriber::Subscriber>::new_span::h2c1f9010dda681f0
19: 0x13e1128 - tracing::span::Span::make_with::h0946854b8046d955
20: 0x1c5eda4 - <vector::sinks::kafka::sink::KafkaSink as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}::h21f35518df6934ca
21: 0x1c5c8e4 - <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}::ha31ca5ad7a9a3402
22: 0x320ecc8 - vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}::h03ab451c62056c46
23: 0x31c5948 - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::heb356ec0acaec9ef
24: 0x31c53bc - tokio::runtime::task::raw::poll::hd3c4a0678998574d
25: 0x12d2128 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::hb43a1ab74e3daf92
26: 0x12d78e4 - tokio::runtime::task::raw::poll::h9fd75a3a647deed6
27: 0x12bba50 - std::sys_common::backtrace::__rust_begin_short_backtrace::h2b66fe48017742c2
28: 0x12bb690 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hedd6b07b50b25d09
29: 0x129069c - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h407d789be6f3d62c
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
30: 0x129069c - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hf45fc529abdd7c00
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
31: 0x129069c - std::sys::unix::thread::Thread::new::thread_start::hcd656e077ad34229
at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys/unix/thread.rs:108:17
2024-01-15T15:01:48.333995Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.54.188:36347}: vector::sources::util::net::tcp: Accepted a new connection. peer_addr=10.89.54.188:36347
2024-01-15T15:01:48.340134Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.54.188:36347}: vector::sources::util::net::tcp: Connection closed.
2024-01-15T15:01:48.453506Z ERROR sink{component_kind="sink" component_id=kafka component_type=kafka}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2024-01-15T15:01:48.453568Z INFO vector: Vector has stopped.
2024-01-15T15:01:48.453640Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.37.245:7823}: vector::sources::util::net::tcp: Start graceful shutdown.
2024-01-15T15:01:48.453652Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.62.209:14033}: vector::sources::util::net::tcp: Start graceful shutdown.
2024-01-15T15:01:48.453739Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}: vector::topology::builder: Source finished normally.
2024-01-15T15:01:48.454775Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="remap_source_gateway, gateway" time_remaining="59 seconds left"
2024-01-15T15:01:49.421134Z ERROR source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.37.245:7823}: vector::internal_events::tcp: Error writing acknowledgement, dropping connection. error=Broken pipe (os error 32) error_code="ack_failed" error_type="writer_failed" stage="sending" internal_log_rate_limit=true
2024-01-15T15:01:49.421383Z DEBUG transform{component_kind="transform" component_id=remap_source_gateway component_type=remap}: vector::topology::builder: Synchronous transform finished with an error.
2024-01-15T15:01:49.421399Z ERROR transform{component_kind="transform" component_id=remap_source_gateway component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-01-15T15:01:53.454051Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="54 seconds left"
2024-01-15T15:01:58.454189Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="49 seconds left"
2024-01-15T15:02:03.454918Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="44 seconds left"
2024-01-15T15:02:08.454973Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="39 seconds left"
2024-01-15T15:02:13.454684Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="34 seconds left"
2024-01-15T15:02:18.454653Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="29 seconds left"
2024-01-15T15:02:18.454660Z INFO source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.62.209:14033}: vector::sources::util::net::tcp: Resetting connection (still open after seconds). seconds=30s
2024-01-15T15:02:18.454729Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}: vector::topology::builder: Source pump finished normally.
2024-01-15T15:02:18.454758Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}: vector::topology::builder: Source pump supervisor task finished normally.
Example Data
No response
Additional Context
No response
References
No response
Hi @michMartineau !
Apologies for the delay in reproducing this. I tried today but unfortunately wasn't able to reproduce this issue using a local Kafka instance and Logstash to send data. Do you run into it consistently? If so, could you try enabling Kafka debug logging to try to troubleshoot? This can be done by doing both of the following:
- Setting the environment variable:
VECTOR_LOG="librdkafka=trace,rdkafka::client=debug"
- Adding
librdkafka_options.debug = "all"
to thekafka
sink
Hi @jszwedko, Thank for the tips, I will test that today. In our non production environments, the issue appears randomly. In our production environments, where there are more activities, the issue was more constant. So we rollbacked the deployment for them. I didn't find a root cause.
Hi @jszwedko I've reproduced the issue with the configuration you have given to me. I've also added the env variable RUST_BACKTRACE: full here the log file: vector-kafka-sink-issue-20240207.log
I wonder if it is not more an issue with internal tracing than a kafka issue.
Just a remark VECTOR_LOG="librdkafka=trace,rdkafka::client=debug"
seems not a valid log level configuration. So the log file above contains only results from librdkafka_options.debug = "all"
Thanks for that @michMartineau ! That is interesting. To confirm, you are only seeing this issue on 0.35.0 and never saw it on 0.34.2?
Assuming that's true, two changes jump out to me that were in the 0.35.0 release:
- https://github.com/vectordotdev/vector/pull/19082
- https://github.com/vectordotdev/vector/pull/19090
The tracing changes seem like a more likely culprit based on the backtrace, but I'm not seeing any smoking guns.
Indeed i've never seen this issue with 0.34.2.
Indeed i've never seen this issue with 0.34.2.
Would you be willing to try out a custom image? I can create one that reverts the tracing changes to see if that fixes it for you.
Good idea, I can try.
@michMartineau I published a custom build that is v0.35.0 with the kafka tracing changes reverted. Could you try one of these images and let me know how it goes?
https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.35.0.custom.059fb1b
Hi @jszwedko , I've tested it and I was wrong. It failed. My bad. Last time, I tested 0.34.2 it was ok. but as it seems related to activity load. So I've tested all versions from 0.33.0 to 0.35
Chart | Binary | Result |
---|---|---|
0.30.0 | 0.35.0 | KO |
0.29.1 | 0.34.2 | KO |
0.29.0 | 0.34.1 | KO |
0.28.0 | 0.34.0 | KO |
0.27.0 | 0.33.1 | KO |
0.26.0 | 0.33.0 | OK |
So it seems due to a change with 0.33.1 here logs with the version 0.33.1 vector-0.33.1.log
I wonder if it could be related to this commit https://github.com/vectordotdev/vector/commit/fa09de37c735bec57a67d78641b9db13c17097d8
Thanks for doing that analysis @michMartineau ! Let me spin up a custom build that is v0.33.1 without that commit and we can see how that looks. I agree it seems like the most suspicious.
Pushed another custom build. Can you try one of https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.33.1.custom.a1b6913 ?
Yes sure. Thx again, I can't today. but I will do next Monday
Hi, I've tested several time the version 0.33.1 and the custom version 0.33.1 without this commit.
- Crashloop backoff each time 0.33.1 is deployed.
- Nothing with the custom version.
Thanks for testing! We'll take a closer look at that commit and see if anything obviously jumps out.
looks like this has been shown up more frequently now, Thanks for pointing to the exact commit. awaiting more findings @jszwedko . Thanks @michMartineau !
any luck with identifying and fixing this bug?
it seems to be during rebalancing or while some change at kafka which trigger change in vector.(this is just guess based on closer debug on the issue. I may be wrong as well but this is inital understanding)
We think we may have found the issue, fixing it in https://github.com/vectordotdev/vector/pull/20001. I created another set of custom docker images to try: https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.35.1.custom.34bc8a. @michMartineau (or others) do you mind giving those a try and letting us know if they fix the issue for you?
@jszwedko sure I will test that.
@jszwedko I did some tests yesterday and today. I've switched between the 0.36.0 and the custom version. With 0.36.0-alpine, I saw some restarts. nothing with custom version. so look good to me
@jszwedko I did some tests yesterday and today. I've switched between the 0.36.0 and the custom version. With 0.36.0-alpine, I saw some restarts. nothing with custom version. so look good to me
Thanks for verifying!