[WIP/running tests] POC: filestream growing fingerprint identity
Add a proof-of-concept for a new "growing_fingerprint" file identity mode that addresses the limitation where files smaller than the fingerprint size (default 1024 bytes) cannot be tracked.
Key changes:
- Add growingFingerprintIdentifier that stores raw bytes (hex-encoded) instead of a hash, allowing the fingerprint to grow as the file grows
- Files can be tracked immediately regardless of size (no minimum threshold)
- Implement prefix matching: when a file grows, match the old (shorter) fingerprint as a prefix of the new (longer) fingerprint
- Add IterateOnPrefix() and UpdateKey() to the store for registry key migration
- Support in-place key updates without interrupting running harvesters
- Default max_length is 1000 bytes (matching OTEL's filelog receiver)
Configuration:
prospector.scanner:
fingerprint.growing: true
fingerprint.max_length: 1000 # optional, default 1000
file_identity.growing_fingerprint: ~
This enables tracking small files that share initial content (e.g., common headers) by allowing their fingerprints to diverge as they grow with unique content.
Includes integration tests covering:
- Small files tracked immediately
- Files with identical initial content differentiated as they grow
- Fingerprint migration on file growth
- Restart scenarios
Open question/issue
Using the raw bytes (hex-encoded) as the fingerprint makes it easier to compare prefix matches, however increases the memory consumption, up to 1000 bytes per file per in-memory instance of the fingerprint. Also, it increases the storage used by the registry on disk.
Proposed commit message
Checklist
- [ ] My code follows the style guidelines of this project
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have made corresponding change to the default configuration files
- [ ] I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the
stresstest.shscript to run them under stress conditions and race detector to verify their stability. - [ ] I have added an entry in
./changelog/fragmentsusing the changelog tool.
Disruptive User Impact
Author's Checklist
- [ ]
How to test this PR locally
Related issues
Use cases
Screenshots
Logs
:robot: GitHub comments
Just comment with:
rundocs-build: Re-trigger the docs validation. (use unformatted text in the comment!)
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b 44780-growing-fingerprint upstream/44780-growing-fingerprint
git merge upstream/main
git push upstream 44780-growing-fingerprint