vector icon indicating copy to clipboard operation
vector copied to clipboard

`vector` refuses to start when connectivity to one/any external service is not working

Open james-stevens opened this issue 1 year ago • 1 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

When running vector on RHEL9, the file /usr/lib/systemd/system/vector.service contains the line

 ExecStartPre=/usr/bin/vector validate

This means vector will refuse to start-up if an external connection is not currently available, instead of starting up then retrying the connection, which is what it would do if the connection had gone down after it had successfully started.

From our config, I tried removing

healthchecks:
  require_healthy: true

and removing this from every service

    healthcheck:
      enabled: true

but vector validate still fails causing the service to refuse to start.

I would suggest ExecStartPre=/usr/bin/vector validate in the vector.system file could have either --no-environment or --skip-healthchecks added so vector will start up & retry the external connection once started, which is what it would do if the connection had failed during normal operation.

Because we use vector to run other data migration services, in this case aggregating metrics, having them all fail because one (or more) is not working is not really a useful mode of operation.

Configuration

api:
  enabled: true
  address: 127.0.0.1:8686

expire_metrics_secs: 300

healthchecks:
  require_healthy: true

sources:
  vector_metrics:
    type: internal_metrics


  services_metrics:
    type: prometheus_scrape
    scrape_interval_secs: 15
    scrape_timeout_secs: 2
    endpoints:
      - "http://127.0.0.1:9200/metrics"
      - "http://127.0.0.1:9167/metrics"


  dnstap:
    type: dnstap
    socket_path: /var/lib/vector/dnstap.sock
    socket_file_mode: 0o777
    mode: unix
    multithreaded: true

  relay_blocks:
    type: vector
    address: 10.17.252.114:9001

sinks:
  output_my_prom:
    type: prometheus_exporter
    address: 172.17.252.114:9100
    inputs:
      - vector_metrics
      - services_metrics

  vector_dnstap:
    inputs: [ dnstap ]
    type: vector
    address: "<hostname>:9000"
    buffer:
      max_size: 2684354880
      type: "disk"
      when_full: "drop_newest"
    healthcheck:
      enabled: true
    tls:
      enabled: true
      ca_file: /etc/vector/pems/myCA.pem
      key_file: /etc/vector/pems/vector.pem
      crt_file: /etc/vector/pems/vector.pem
      key_pass: "****"
      verify_certificate: true
      verify_hostname: true

  vector_relay_blocks:
    inputs: [ relay_blocks ]
    type: vector
    address: "<hostname>:9001"
    buffer:
      max_size: 2684354880
      type: "disk"
      when_full: "drop_newest"
    healthcheck:
      enabled: true
    tls:
      enabled: true
      ca_file: /etc/vector/pems/myCA.pem
      key_file: /etc/vector/pems/vector.pem
      crt_file: /etc/vector/pems/vector.pem
      key_pass: "****"
      verify_certificate: true
      verify_hostname: true

Version

vector 0.37.0 (x86_64-unknown-linux-gnu c1da408 2024-03-26 13:41:34.870460047)

Debug Output

# vector validate
√ Loaded ["/etc/vector/vector.yaml"]
√ Component configuration
2024-04-26T11:02:06.910152Z ERROR vector::topology::builder: msg="Healthcheck failed." error=Request failed: status: Unavailable, message: "error trying to connect: error:0A000086:SSL routines:(unknown function):certificate verify failed:ssl/statem/statem_clnt.c:2092:: unable to get local issuer certificate", details: [], metadata: MetadataMap { headers: {} } component_kind="sink" component_type="vector" component_id=vector_relay_blocks
x Health check for "vector_relay_blocks" failed: Request failed: status: Unavailable, message: "error trying to connect: error:0A000086:SSL routines:(unknown function):certificate verify failed:ssl/statem/statem_clnt.c:2092:: unable to get local issuer certificate", details: [], metadata: MetadataMap { headers: {} }
2024-04-26T11:02:07.010816Z ERROR vector::topology::builder: msg="Healthcheck failed." error=Request failed: status: Unavailable, message: "error trying to connect: error:0A000086:SSL routines:(unknown function):certificate verify failed:ssl/statem/statem_clnt.c:2092:: unable to get local issuer certificate", details: [], metadata: MetadataMap { headers: {} } component_kind="sink" component_type="vector" component_id=vector_dnstap
x Health check for "vector_dnstap" failed: Request failed: status: Unavailable, message: "error trying to connect: error:0A000086:SSL routines:(unknown function):certificate verify failed:ssl/statem/statem_clnt.c:2092:: unable to get local issuer certificate", details: [], metadata: MetadataMap { headers: {} }
√ Health check "output_my_prom"

Example Data

No response

Additional Context

No response

References

No response

james-stevens avatar Apr 26 '24 11:04 james-stevens

You talk about vector validate, but with amqp sink I observe that neither validate nor vector itself can start when sink does not respond correctly on the port. I can see that vector validate behaves slightly differently with or without --skip-healthchecks in situation when dummy port is opened, but in both cases it fails with exit code 78.

$ . test-vector.sh
+ vector validate
√ Loaded ["/etc/vector/vector.yaml"]

Component errors
----------------
x Sink "rabbitmq": creating amqp producer failed: IO error: Connection refused (os error 111)

+ echo exited 78
exited 78
+ vector validate --skip-healthchecks
√ Loaded ["/etc/vector/vector.yaml"]

Component errors
----------------
x Sink "rabbitmq": creating amqp producer failed: IO error: Connection refused (os error 111)

+ echo exited 78
exited 78
+ vector validate
+ nc -lk -p 5673 127.0.0.1
√ Loaded ["/etc/vector/vector.yaml"]
2024-06-05T10:51:38.322879Z ERROR lapin::io_loop: Socket was readable but we read 0. This usually means that the connection is half closed this mark it as broken
2024-06-05T10:51:38.322964Z ERROR lapin::io_loop: error doing IO error=IOError(Kind(ConnectionAborted))
2024-06-05T10:51:38.323042Z ERROR lapin::channels: Connection error error=IO error: connection aborted
AMQP
Component errors
----------------
x Sink "rabbitmq": creating amqp producer failed: IO error: connection aborted

+ echo exited 78
exited 78
+ vector validate --skip-healthchecks
√ Loaded ["/etc/vector/vector.yaml"]

Component errors
----------------
x Sink "rabbitmq": creating amqp producer failed: IO error: Connection refused (os error 111)

+ echo exited 78
exited 78

brablc avatar Jun 05 '24 10:06 brablc