beats icon indicating copy to clipboard operation
beats copied to clipboard

[WIP/running tests] POC: filestream growing fingerprint identity

Open AndersonQ opened this issue 3 weeks ago • 1 comments

Add a proof-of-concept for a new "growing_fingerprint" file identity mode that addresses the limitation where files smaller than the fingerprint size (default 1024 bytes) cannot be tracked.

Key changes:

  • Add growingFingerprintIdentifier that stores raw bytes (hex-encoded) instead of a hash, allowing the fingerprint to grow as the file grows
  • Files can be tracked immediately regardless of size (no minimum threshold)
  • Implement prefix matching: when a file grows, match the old (shorter) fingerprint as a prefix of the new (longer) fingerprint
  • Add IterateOnPrefix() and UpdateKey() to the store for registry key migration
  • Support in-place key updates without interrupting running harvesters
  • Default max_length is 1000 bytes (matching OTEL's filelog receiver)

Configuration:

prospector.scanner:
  fingerprint.growing: true
  fingerprint.max_length: 1000  # optional, default 1000
  file_identity.growing_fingerprint: ~

This enables tracking small files that share initial content (e.g., common headers) by allowing their fingerprints to diverge as they grow with unique content.

Includes integration tests covering:

  • Small files tracked immediately
  • Files with identical initial content differentiated as they grow
  • Fingerprint migration on file growth
  • Restart scenarios

Open question/issue

Using the raw bytes (hex-encoded) as the fingerprint makes it easier to compare prefix matches, however increases the memory consumption, up to 1000 bytes per file per in-memory instance of the fingerprint. Also, it increases the storage used by the registry on disk.

Proposed commit message

Checklist

  • [ ] My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool.

Disruptive User Impact

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

AndersonQ avatar Dec 10 '25 15:12 AndersonQ

:robot: GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

github-actions[bot] avatar Dec 10 '25 15:12 github-actions[bot]

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 44780-growing-fingerprint upstream/44780-growing-fingerprint
git merge upstream/main
git push upstream 44780-growing-fingerprint

mergify[bot] avatar Dec 15 '25 20:12 mergify[bot]