vector icon indicating copy to clipboard operation
vector copied to clipboard

feat(sources): Add new ifile source with improved implementation

Open tamer-hassan opened this issue 7 months ago • 15 comments
trafficstars

Summary

A new ifile source has been added as an improved version of the file source. The ifile source is a complete rewrite that addresses several limitations of the original implementation while maintaining compatibility with existing configurations.

Key improvements in the ifile source:

  • Fully async implementation: Uses async/await throughout for better performance and resource utilization
  • Cross-platform filesystem notifications: Uses the notify-rs library to detect file changes through OS-level notifications instead of polling
  • Improved file discovery: Detects new files within milliseconds using filesystem notifications instead of periodic globbing. Only globs once on startup for initial discovery, and therefore obsoletes and removes the glob_minimum_cooldown_ms option
  • Resource efficiency: Never keeps file handles open for idle files, only opening them when needed for reading
  • Better shutdown behavior: Properly handles shutdown signals and gracefully closes all resources
  • Improved checkpointing: Introduces a new checkpoint_interval configuration option
  • Startup consistency: Reads files at startup to detect changes that occurred while Vector was stopped
  • Intelligent throttling: Optimizes CPU usage and log verbosity with smart event handling
  • Better error handling: Properly handles file deletion events to prevent repeated error messages

The original file source remains unchanged and fully supported. Users can migrate to the ifile source at their convenience to take advantage of these improvements.

Change Type

  • [x] Bug fix
  • [x] New feature
  • [ ] Non-functional (chore, refactoring, docs)
  • [x] Performance

Is this a breaking change?

  • [ ] Yes
  • [x] No

How did you test this PR?

Tested on Linux. To be tested on other OS's as supported by notifyrs/notify library. Currently:

  • Linux / Android: inotify
  • macOS: FSEvents or kqueue, see features
  • Windows: ReadDirectoryChangesW
  • iOS / FreeBSD / NetBSD / OpenBSD / DragonflyBSD: kqueue
  • All platforms: polling

Does this PR include user facing changes?

  • [x] Yes. Please add a changelog fragment based on our guidelines.
  • [ ] No. A maintainer will apply the "no-changelog" label to this PR.

References

  • Ref: https://github.com/vectordotdev/vector/issues/3567
  • Ref: https://github.com/vectordotdev/vector/issues/22733
  • Ref: https://github.com/vectordotdev/vector/issues/22297
  • Ref: https://github.com/vectordotdev/vector/issues/16743
  • Ref: https://github.com/vectordotdev/vector/issues/14818
  • Ref: https://github.com/vectordotdev/vector/issues/9407

tamer-hassan avatar Apr 04 '25 22:04 tamer-hassan

CLA assistant check
All committers have signed the CLA.

bits-bot avatar Apr 04 '25 22:04 bits-bot

Thank you @tamer-hassan, this PR looks very promising. I will try to review sometime this week. Hopefully we can include it in the next release.

pront avatar Apr 07 '25 18:04 pront

To summarize the changes made to fix the failing tests:

  1. Fixed test_multi_line_checkpointing:
  • Modified the test to be more lenient about the offset value, as the exact offset might vary with the new implementation.
  • Added filtering for empty lines in the output.
  1. Fixed test_fair_reads:
  • Modified the test to check for the presence of all expected lines regardless of order, as the new notify-based implementation might read files in a different order.
  1. Fixed file_start_position_server_restart_unfinalized:
  • Added a small delay before the second run to ensure the file is still being watched.
  • Made the assertion more lenient to handle potential empty lines.
  1. Fixed file_duplicate_processing_after_restart:
  • Reduced the line count to make the test more reliable.
  • Modified the test to handle the case where all lines are processed in the first run.
  1. Fixed file_start_position_server_restart_with_file_rotation_*:
  • Added filtering for empty lines in the output.
  1. Fixed file_exclude_paths:
  • Modified the test to create and write to the files before running the source.
  • Added filtering for the initial content line.
  1. Ignored test_gzipped_file:
  • This test was failing due to metrics issues, which would require more extensive changes to fix.
  • Marked it as ignored for now, with a comment explaining why.

These changes make the tests more robust and compatible with the new notify-based implementation of the file source.

tamer-hassan avatar Apr 20 '25 20:04 tamer-hassan

Hi @tamer-hassan, while reviewing I found myself worrying about what might shifted/changed in the existing implementation vs on the new code.

What do you think about introducing this as file2?

Open to better naming as well. Maybe FileMonitor, FileNotifier etc. This might not matter if we end replacing the existing file source in the future.

* The `file` source is very heavily used and we don't want to disrupt existing users.

* IMO this is more exciting because you have the freedom to design a new UX (it can be similar to the existing one)
  
  * mixing new and old via config knobs has always been tricky

* Decouple from existing implementation choices and maybe improve performance even further

@pront Completely understood and agreed!

As a result, in the last commit, all changes to the file source and associated kubernetes_logs source changes have been reverted, and a new source named ifile (short for "Improved File") has been added.

tamer-hassan avatar Apr 24 '25 04:04 tamer-hassan

I've left a lot of comments, but they're all very minor style considerations, please do what makes the most sense for consistency for Vector. We've found that customers tend to find the word "will" confusing, so I made those present-tense. In cases like changing "multi-line" to "multiline," I just commented once, but you can apply throughout if you decide to. Thank you!

@jhgilbert Thanks for the review! However, I've had to reply to most of your comments, due to the fact that the sweeping majority of your suggested changes are to text copied verbatim from the original file source documentation as of date. I have added references to the master branch as well where due. Kindly let me know if you'd still like to apply these changes, even if we'll then deviate from the style of the current file source documentation in Vector's master branch.

tamer-hassan avatar Apr 25 '25 18:04 tamer-hassan

Hi @tamer-hassan, I wanted to express that we are excited about this work. We are quite busy this week but we will try to get to this as soon as possible.

pront avatar Apr 30 '25 13:04 pront

@pront

Here's a summary of changes since your last review:

  • Moved shared code between the file-source and ifile-source libraries to a new file-source-common library:
    • Moved the buffer.rs module to file-source-common
    • Moved the checkpointer.rs module to file-source-common
    • Moved the fingerprinter.rs module to file-source-common
    • Moved the metadata_ext.rs module to file-source-common
    • Moved the internal_events.rs module to file-source-common
  • Updated the imports in both libraries to use the common modules:
    • Updated imports in file-source/src/lib.rs
    • Updated imports in ifile-source/src/lib.rs
    • Updated imports in file-source/src/file_server.rs
    • Updated imports in ifile-source/src/file_server.rs
    • Fixed Arc type annotations in both file_server.rs files
  • Fixed various issues that arose during the refactoring:
    • Added a new() method to the Fingerprinter struct
    • Fixed the maybe_upgrade method in the CheckpointsView struct
    • Fixed the test_checkpointer_fingerprint_upgrades_unknown test
    • Made the FINGERPRINT_CRC constant public
    • Added the Write trait to the checkpointer tests

tamer-hassan avatar May 11 '25 02:05 tamer-hassan

@pront just came to my attention that the ifile source, unlike file source, will not work for network file systems like NFS and SMB if vector runs on the client rather than the server, as the kernel is then not aware of changes happening at the block device level at remote ends. I should probably document this, right?

tamer-hassan avatar May 15 '25 04:05 tamer-hassan

@pront just came to my attention that the ifile source, unlike file source, will not work for network file systems like NFS and SMB if vector runs on the client rather than the server, as the kernel is then not aware of changes happening at the block device level at remote ends. I should probably document this, right?

Good thought!

pront avatar May 15 '25 16:05 pront

Hi @pront

  1. made file-source-common optional
  2. Added a requirement to ifile source documentation to clarify that it will not work client-side for network file systems.

tamer-hassan avatar May 28 '25 07:05 tamer-hassan

Hi @pront

  1. made file-source-common optional
  2. Added a requirement to ifile source documentation to clarify that it will not work client-side for network file systems.

Hi @tamer-hassan, I am back from PTO and I will take another look at this PR soon.

pront avatar Jun 09 '25 18:06 pront

I took another look, this is starting to look great. There are some failing checks. If you fix all these, I will do another pass. Hopefully we can include this in the next release 🤞

pront avatar Jun 10 '25 14:06 pront

I took another look, this is starting to look great. There are some failing checks. If you fix all these, I will do another pass. Hopefully we can include this in the next release 🤞

Hi @pront I am 100% sure that all checks were passing prior to the merge of 'origin/master' into this branch (file-source-notify) by you, and prior to making file-source-common optional as you recommended.

That is, the failing spelling checks are related to commits from origin/master: https://github.com/vectordotdev/vector/actions/runs/15562290159/attempts/1#summary-43817450582

  • the "it'is" unrecognized word is coming from website/content/en/docs/about/under-the-hood/guarantees.md which I never touched.
  • the "QA’d" unrecognized word is coming from CONTRIBUTING.md which I also never touched.

Prior to that, I realized that make check-clippy was failing due to making file-source-common optional in lib/vector-lib/Cargo.toml per your request. If I remove the optional parameter for file-source-common, then make check-clippy succeeds.

Please check and revert on how you'd like to proceed.

tamer-hassan avatar Jun 10 '25 17:06 tamer-hassan

hi @pront I've added a "dep:file-source-common" on the optional features file-source and ifile-source after having made file-source-common optional, to fix the failing make check-clippy.

The other spelling check failures are from origin/master as mentioned.

tamer-hassan avatar Jun 10 '25 17:06 tamer-hassan

@pront finally, fixed the underlying cause of the failing make test-component-validation stage (file-source-common)

tamer-hassan avatar Jun 11 '25 01:06 tamer-hassan

@thomasqueirozb Hello. I'm interested in this pull request. I'm curious why it was drafted and what needs to be reviewed. Could you please let me know?

lhgwoo2 avatar Aug 14 '25 02:08 lhgwoo2

thanks @thomasqueirozb for picking this up. and apologies I've not responded in a while. now that https://github.com/vectordotdev/vector/pull/23607 is merged, do you intend to fix the merge conflicts in this branch, or should I take it over?

tamer-hassan avatar Sep 06 '25 00:09 tamer-hassan

Hey @lhgwoo2 and @tamer-hassan Sorry, I missed the notifications for both your comments. This PR was drafted because it was way too big and tests were also failing IIRC.

I have split this up into a couple of smaller PRs

  1. #23607 - Refactor into file-source-common
  2. #23612 - Refactor into async (almost done, just need to iron out some details. If you want you can help review these changes @tamer-hassan).

I would suggest against doing any development in this branch before these PRs are merged to avoid conflicts/duplicating efforts.

Once merged I plan to grab changes from this PR and reimplement them in an async manner, which would resolve this comment and also eliminate the need for the handle and most usages of block_on. I might make a new PR or work off of this one (probably will make a new PR since it'll be easier)

thomasqueirozb avatar Sep 08 '25 14:09 thomasqueirozb

Hi popping in here with an update on this. It's been a while since I touched this but the following are some outstanding issues that still need to be fixed:

  • [ ] Shouldn’t read all lines during startup - this should be buffered/done later. It causes the source to stall if large files are included. This would also mean that test_fair_reads should be fixed as right now it doesn't really test this functionality.
  • [ ] Sometimes files are skipped (maybe race condition with file creation/watcher startup). I have had the file_happy_path (the simplest test) fail. It doesn't occur frequently but sometimes it seems that the notify watcher doesn't notify about recently added files. I think that we may need to introduce periodic globs too to help mitigate this but it'd be better if the notify handler did everything
  • [ ] Fix gzip. The gzip test seems to be broken and it needs to be fixed.
  • [ ] No tests should be skipped/ignored

I don't think I'm going to work on this right now but will probably revisit this at a later date. Anyone should feel free to try and work on fixing these issues.

thomasqueirozb avatar Nov 18 '25 20:11 thomasqueirozb