File source after deletion and writing again not working properly
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
After file removal and immediate create and write again, vector may keep file for delition and does not see any new input for this file.
Way to reproduce the bug with the configuration, attached below
- Go to working directory:
cd /tmp - Simulate rotating in bash. We're going to do it with file removal.
for i in {1..101}; do echo "abc: $i" >> 1.txt; done; rm 1.txt; for i in {1..10100}; do echo "cde: $i" >> 1.txt; done; - check if file is not released by vector:
lsof -nP +L1
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
vector 413099 root 9r REG 259,2 100795 0 8126906 /tmp/1.txt (deleted)
This is the case when vector did not release the file.
If you see no held files, try again from step 2. Usually we need 2-3 attempts to get file not released by vector.
After this happens, vector does not see any input from this file:
$ tail -n 1 vector-1234.log
{"file":"/tmp/1.txt","host":"host","message":"abc: 101","source_type":"file","timestamp":"2023-01-10T10:52:36.667782990Z"}
Note, this is the last line before file was dropped. No data from new file is presented in output log
Configuration
sources:
post_dummy_source:
type: file
include:
- /tmp/*.txt
sinks:
my_sink_id:
type: file
inputs:
- post_*
compression: none
encoding:
codec: json
path: /tmp/vector-1234.log
### Version
vector 0.26.0 (x86_64-unknown-linux-gnu c6b5bc2 2022-12-05)
### Debug Output
_No response_
### Example Data
_No response_
### Additional Context
The issue happens for local vector and k8s vector
### References
_No response_
What a timing. This morning we just experienced a "too many open files" OS error coming from Vector on a file-based pipeline that also is supposed to be deleting old files. We discovered that indeed Vector did not delete all the files it should and exhausted the open files limit of the OS. The end effect is that lots of old files were left on the server and the disk started to fill up.
I can only reproduce this issue if I run the same for i in {1..101}; do echo "abc: $i" >> 1.txt; done; rm 1.txt; for i in {1..10100}; do echo "cde: $i" >> 1.txt; done; multiple times.
If I run this command:
for i in {1..101}; do echo "abc: $(date +%s.%N)" >> 1.txt; done; rm 1.txt; for i in {1..101}; do echo "cde: $(date +%s.%N)" >> 1.txt; done;
this problem does not seem to occur, since no two files will ever have the same fingerprint.
It's likely the problem is caused because
- Vector is tailing a file - first line
cde: 1. - We append
abc: ..onto this file. This file is deleted. - We create a new file - first line
cde: 1.
Since the fingerprint hasn't changed, Vector doesn't realise that this is a new file so it keeps the file handle to the deleted file open.
I'm not sure if there is a solution here. Real world cases should not be creating files with the same fingerprint.
What a timing. This morning we just experienced a "too many open files" OS error coming from Vector on a file-based pipeline that also is supposed to be deleting old files. We discovered that indeed Vector did not delete all the files it should and exhausted the open files limit of the OS. The end effect is that lots of old files were left on the server and the disk started to fill up.
@hhromic Is it possible you are creating multiple files with the same fingerprint?
Hi @StephenWakely, thanks for investigating this! Your findings are quite interesting.
Is it possible you are creating multiple files with the same fingerprint?
We are using the following configuration:
fingerprint:
strategy: checksum
lines: 7
ignored_header_bytes: 5
However, we are processing gzipped files and as I reported in #13193, the checksum fingerprinter has issues on compressed data. After what you reported, now I'm thinking that this issue (for us) might be related to the one I reported for gzipped data.
Is there a way for me to verify that Vector is seeing the same fingerprints to answer your question?
I just realised that the entity deleting the files in our case is Vector itself (configured to do so in the file source).
In the issue reported here and the reproduction case being used, the entity deleting files is external to Vector.
Therefore, it seems our issue is actually different than the one in this report :(.
This issue happens indeed due to the fingerprint.
We have similar issue in our production environment. Will check whether it happens due to the checksum equality. If not, will try to create a new ticket to simulate our production environment issue.
Thank you @StephenWakely!