vector Vector metric vector_open_files not showing correct data and missing description in documentation

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We are using file source to fetch pod logs and push them to kafka using the kafka sink. We want a mechanism which we can be certain that vector is not losing or laging far behind. To do so, we saw that there is a metric in vector vector_open_files which is not mentioned in the documentation but actually exists. We assume that this metric is at any given time how many files are open by vector for reading. Our configuration is such that at any given time vector agent can be reading at max 2 files (file and the 2nd file created due to rotation containing copy of first file). However in the graph we see that the metric value reaches 3 from time to time. Also when a file is rotated and vector detects it, ideally it should complete reading that file and the vector_open_files should drop to 1.

Our main blocker shifting to vector is a way using which we can absolutely be sure that vector is upto the speed and it not lagging far behind. Also a mechanism using which we can get insights that no data is being lost while file reading.

Configuration

customConfig:
  data_dir: /vector-data-dir
  acknowledgements:
    enabled: true
  api:
    enabled: true
    address: 127.0.0.1:8686
    playground: true
  sources:
    logs:
      type: file
      oldest_first: true
      exclude:
        - /var/log/pods/particular-pod-directory-*/container_name/*.tmp
        - /var/log/pods/particular-pod-directory-*/container_name/*.gz
      include:
        - /var/log/pods/particular-pod-directory-*/container_name/*
    internal_metrics:
      type: internal_metrics
  sinks:
    prom_exporter:
      type: prometheus_exporter
      inputs: [internal_metrics]
      address: 0.0.0.0:9090
      buffer:
        type: disk
        when_full: block
        max_size: 10000000000
    kafka:
      type: kafka
      inputs:
        - logs
      bootstrap_servers: brokers:9092
      topic: test
      encoding:
        codec: json
      compression: zstd
      healthcheck:
        enabled: false
      librdkafka_options:
        request.required.acks: "1"
        message.timeout.ms: "0"
        batch.num.messages: "8192"
        linger.ms: "100"
        batch.size: "1000000"
      message_timeout_ms: 0
      buffer:
        type: disk
        when_full: block
        max_size: 10000000000

Currently we are testing vector at 20k requests per seconds. Our actual application can have logs produced at about 200k requests per seconds.

We haven't chose the kubernetes_logs source at the moment since we don't want any enrichment

Version

0.37.1

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

May 04 '24 19:05 ShahroZafar

Thanks @ShahroZafar . I do see that the metric is undocumented. It should, as you note, measure the number of files the file source has open.

. Also when a file is rotated and vector detects it, ideally it should complete reading that file and the vector_open_files should drop to 1.

When the file is rotated, does it match one of the exclude patterns?

As an aside, you could try increasing max_read_bytes. Users often see better performance with a limit.

May 06 '24 18:05 jszwedko

When the file is rotated, does it match one of the exclude patterns?

No. The exclude pattern is limited to .tmp and .gz. The rotated file is not .gz. Its in the format 0.logs.{Timestamp}

As an aside, you could try increasing max_read_bytes. Users often see better performance with a limit.

We have oldest_first: true since these are kubernetes logs and we want to read the old files as soon as possible before they are further rotated to be .gz. And I think as per docs (please correct me if I am wrong) if older_first is set, max_read_bytes doesn't come into play

May 06 '24 18:05 ShahroZafar

When the file is rotated, does it match one of the exclude patterns?

No. The exclude pattern is limited to .tmp and .gz. The rotated file is not .gz. Its in the format 0.logs.{Timestamp}

As an aside, you could try increasing max_read_bytes. Users often see better performance with a limit.

We have oldest_first: true since these are kubernetes logs and we want to read the old files as soon as possible before they are further rotated to be .gz. And I think as per docs (please correct me if I am wrong) if older_first is set, max_read_bytes doesn't come into play

Ah, I missed that you had oldest_first, yes that should cause it to read the oldest files first then rather than round-robin balancing. My expectation would match yours then:

Also when a file is rotated and vector detects it, ideally it should complete reading that file and the vector_open_files should drop to 1.

However, I believe the file source will maintain open file handles to all matching files, even if it isn't actively reading them. Related: https://github.com/vectordotdev/vector/issues/10005

May 06 '24 18:05 jszwedko