vector icon indicating copy to clipboard operation
vector copied to clipboard

Vector sink to Kafka doesn't retry

Open pratanini opened this issue 2 years ago • 4 comments

Vector Version

vector 0.14.0 (x86_64-unknown-linux-gnu 5f3a319 2021-06-03)

Vector Configuration File

# General
data_dir = "/vector/data/"

# Vector's API for introspection
[api]
enabled = true
address = "127.0.0.1:8686"

# Host-level logs
[sources.logs]
type = "docker_logs"

# Capture the metrics from the host.
[sources.host_metrics]
  collectors = ["cpu", "disk", "filesystem", "load", "host", "memory", "network"]
  namespace = "node"
  type = "host_metrics"
  [sources.host_metrics.filesystem]
    [sources.host_metrics.filesystem.devices]
      excludes = ["binfmt_misc"]
    [sources.host_metrics.filesystem.filesystems]
      excludes = ["binfmt_misc"]
    [sources.host_metrics.filesystem.mountpoints]
      excludes = ["*/proc/sys/fs/binfmt_misc"]


# Emit internal Vector metrics.
[sources.internal_metrics]
  type = "internal_metrics"


# Transforms
[transforms.enriched_internal_metrics]
  type = "remap"
  inputs = ["internal_metrics"]
  source = '''
  .tags.docker_host = del(.tags.host)
  '''

[transforms.enriched_host_metrics]
  type = "remap"
  inputs = ["host_metrics"]
  source = '''
  .tags.docker_host = del(.tags.host)
  '''

[transforms.enriched_logs]
  type = "remap"
  inputs = ["logs"]
  source = '''
  .labels.deployment = "test"
  '''

# Send logs to the Kafka.
[sinks.kafka]
  # General
  type = "kafka" 
  inputs = ["enriched_logs","enriched_host_metrics", "enriched_internal_metrics"]
  bootstrap_servers = "<KAFKA_URL>:443" 
  compression = "gzip" 
  topic = "topic-test" 

  # Encoding
  encoding.codec = "json" 

  # Healthcheck
  healthcheck.enabled = true 
  # Sasl
  sasl.enabled = true 
  sasl.mechanism = "SCRAM-SHA-512" 
  sasl.password = "test" 
  sasl.username = "test"

  # Buffer
  buffer.max_size = 1049000000
  buffer.type = "disk"
  buffer.when_full = "block"

  # tls
  tls.enabled = true
  tls.verify_certificate = true

  socket_timeout_ms = 300000
  message_timeout_ms = 10000

Debug Output

https://gist.github.com/santoshghhegde/6d560fad6a5b8328f88a165d15b7f359

Expected Behavior

Vector sends logs to Kafka

Actual Behavior

Vector stops sending logs to Kafka

Additional Context

  1. Kafka is deployed in Europe
  2. Vector(deployed in docker) is sending data from APAC and Europe
  3. Kafka is behind AWS NLB

pratanini avatar Jul 27 '21 14:07 pratanini

@jszwedko Did you find time to look into it? It's blocking us to go for production.

pratanini avatar Jul 30 '21 13:07 pratanini

Hey @santoshghhegde !

Apologies for the delay. I took a look at this just now and I think I'm missing some context. In the debug output logs you shared, I'm not seeing any failures writing to Kafka. In the title, you mentioned that Vector isn't retrying; are you observing it fail to write somewhere? Or are you observing the kakfa sink to fail processing altogether, silently?

jszwedko avatar Aug 05 '21 22:08 jszwedko

Hi @jszwedko Yes, in these logs it's failing silently but I have also seen vector doesn't retry if network issues occur. That leads us to 2 issues I guess but I don't know if both are somehow related.

pratanini avatar Aug 06 '21 13:08 pratanini

So I tried to dig into source code and found out that Vector does not handle kafka sink retries but instead let rdkafka internal mechanisms to work with it. There's message_timeout_ms parameter in kafka sink which translates to rdkafka's message.timeout.ms and defaults to 5 minutes.

From rdkafka docs:

Local message timeout. This value is only enforced locally and limits the time a produced message waits for successful delivery. A time of 0 is infinite. This is the maximum time librdkafka may use to deliver a message (including retries). Delivery error occurs when either the retry count or the message timeout are exceeded. The message timeout is automatically adjusted to transaction.timeout.ms if transactional.id is configured.

Retry count is by default set to highest value so this parameter is probably only one that is relevant to retries.

fpytloun avatar Jan 29 '24 12:01 fpytloun