Vector metric vector_open_files not showing correct data and missing description in documentation
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We are using file source to fetch pod logs and push them to kafka using the kafka sink. We want a mechanism which we can be certain that vector is not losing or laging far behind.
To do so, we saw that there is a metric in vector vector_open_files which is not mentioned in the documentation but actually exists. We assume that this metric is at any given time how many files are open by vector for reading.
Our configuration is such that at any given time vector agent can be reading at max 2 files (file and the 2nd file created due to rotation containing copy of first file). However in the graph we see that the metric value reaches 3 from time to time. Also when a file is rotated and vector detects it, ideally it should complete reading that file and the vector_open_files should drop to 1.
Our main blocker shifting to vector is a way using which we can absolutely be sure that vector is upto the speed and it not lagging far behind. Also a mechanism using which we can get insights that no data is being lost while file reading.
Configuration
customConfig:
data_dir: /vector-data-dir
acknowledgements:
enabled: true
api:
enabled: true
address: 127.0.0.1:8686
playground: true
sources:
logs:
type: file
oldest_first: true
exclude:
- /var/log/pods/particular-pod-directory-*/container_name/*.tmp
- /var/log/pods/particular-pod-directory-*/container_name/*.gz
include:
- /var/log/pods/particular-pod-directory-*/container_name/*
internal_metrics:
type: internal_metrics
sinks:
prom_exporter:
type: prometheus_exporter
inputs: [internal_metrics]
address: 0.0.0.0:9090
buffer:
type: disk
when_full: block
max_size: 10000000000
kafka:
type: kafka
inputs:
- logs
bootstrap_servers: brokers:9092
topic: test
encoding:
codec: json
compression: zstd
healthcheck:
enabled: false
librdkafka_options:
request.required.acks: "1"
message.timeout.ms: "0"
batch.num.messages: "8192"
linger.ms: "100"
batch.size: "1000000"
message_timeout_ms: 0
buffer:
type: disk
when_full: block
max_size: 10000000000
Currently we are testing vector at 20k requests per seconds. Our actual application can have logs produced at about 200k requests per seconds.
We haven't chose the kubernetes_logs source at the moment since we don't want any enrichment
Version
0.37.1
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
Thanks @ShahroZafar . I do see that the metric is undocumented. It should, as you note, measure the number of files the file source has open.
. Also when a file is rotated and vector detects it, ideally it should complete reading that file and the vector_open_files should drop to 1.
When the file is rotated, does it match one of the exclude patterns?
As an aside, you could try increasing max_read_bytes. Users often see better performance with a limit.
When the file is rotated, does it match one of the exclude patterns?
No. The exclude pattern is limited to .tmp and .gz. The rotated file is not .gz. Its in the format 0.logs.{Timestamp}
As an aside, you could try increasing max_read_bytes. Users often see better performance with a limit.
We have oldest_first: true since these are kubernetes logs and we want to read the old files as soon as possible before they are further rotated to be .gz. And I think as per docs (please correct me if I am wrong) if older_first is set, max_read_bytes doesn't come into play
When the file is rotated, does it match one of the exclude patterns?
No. The
excludepattern is limited to.tmpand.gz. The rotated file is not.gz. Its in the format0.logs.{Timestamp}As an aside, you could try increasing max_read_bytes. Users often see better performance with a limit.
We have
oldest_first: truesince these are kubernetes logs and we want to read the old files as soon as possible before they are further rotated to be.gz. And I think as per docs (please correct me if I am wrong) ifolder_firstis set,max_read_bytesdoesn't come into play
Ah, I missed that you had oldest_first, yes that should cause it to read the oldest files first then rather than round-robin balancing. My expectation would match yours then:
Also when a file is rotated and vector detects it, ideally it should complete reading that file and the vector_open_files should drop to 1.
However, I believe the file source will maintain open file handles to all matching files, even if it isn't actively reading them. Related: https://github.com/vectordotdev/vector/issues/10005