beats [Filebeat] Filestream: Remove the limitation of not ingesting files smaller than 1kB

Current situation

Since v9.0, the Filebeat's Filestream input has switched the default file_identification mode from native to fingerprint (see v8 docs, v9 docs). From the v9.0.0 release notes:

Filestream inputs now start ingesting files only if they are 1024 bytes or larger because the default file identity has changed from native to fingerprint.

The problem

The limitation to only ingest files larger than 1kB has already proved to be surprising to users. Some users have files smaller than 1kB that they expect to be ingested without delays. Some of those files will grow with time to a size larger than 1kB and will be ingested then (after a delay), some other files may never grow beyond 1kB and will never be ingested.

Some of the users upgrade from v8 to v9 without reading the breaking changes list carefully enough. Others start using the Filebeat at v9, so they're not expected to read the breaking changes list. The limitation is described in the docs, but it still catches many users by surprise, making them think "Filebeat doesn't work".

What is in scope for this issue

One thing to make the situation better is to make it easier for users to identify the problem when it happens and point them to a solution (for example changing the fingerprint size). This is out of scope of this enhancement proposal.

This enhancement proposal focuses on potential ways to remove the file size limitation when using the fingerprint mode.

How others do it

File Log receiver in OpenTelemetry Collector

The File Log receiver also relies on fingerprinting files for identity (in fact it's the only available mode). The default fingerprint size is also 1kB, but the File Log receiver will read files of any size and use for fingerprint whatever the file has - if it's just one byte, the fingerprint will be 1 byte. Fingerprints smaller than default are compared based on prefix, so that when the file is appended to, the receiver still recognizes it as the same file.

Comparing fingerprints by prefix is easy in this implementation, because the original file content is used for the fingerprint, as opposed to the Filestream implementation, where the fingerprint is a hash of the file contents.

Tail plugin in Fluent Bit

The Tail plugin does not support identifying files by their contents.

File source in Vector

The File source's default identity mechanism is also fingerprinting and it also only stores checksums of the fingerprint. By default, it takes the first line of the file, which can be configured to more lines, and can be offset by a number of bytes from the file's beginning, same as Filestream's fingerprint.offset setting.

Possible solutions

In the spirit of brainstorming, below is a list of all possible solutions and their drawbacks and advantages. Even if a solution seems like a no-go, it might point us to a different solution.

Decrease the default fingerprint size to a value lower than `1kB`

Advantages:

easy to apply, no coding required

Drawbacks:

1kB is a default for a reason; decreasing it increases the risk of mistakenly treating a file as a duplicate of another
the minimum possible size is 64 bytes, so files smaller than that will still not be ingested. Much less of a problem than current situation, but still not ideal.
decreasing default fingerprint size is potentially a breaking change, as it might make two files that were previously different look like duplicates

Disable hashing of fingerprints and compare prefixes like the File Log receiver

Apply the solution that the File Log receiver uses: compare fingerprints based on prefix. This would require removing the hashing of the file contents for the fingerprint.

Advantages:

Allows ingesting files as small as 1 byte

Disadvantages:

Requires significant code changes

Store fingerprints of smaller files and compare them with longer

Assuming that we're only dealing with growing files (if a file decreased in size, we must assume it's a different file), we should always be able to calculate a hash of a substring. For example, if we previously fingerprinted file A of size ALen and created fingerprint AFp, and are currently looking at file B of size BLen and want to know if B is the same file as A, we can assume BLen >= ALen, so we can take ALen prefix of B, calculate its hash and compare to AFp. If and only if those hashes are the same, B is the same file as A. Note that we would need to store the length of each calculated fingerprint for that.

Advantages:

Allows ingesting smaller files (1 byte or 64 bytes? depends on implementation?)

Disadvantages:

Requires significant code changes
May result in performance hit, as there's more work to do in calculating additional hashes (needs verification)

Allow to configure `fingerprint.lines` instead of `fingerprint.length`

This sounds like a reasonable and smart thing to do, as log files are split into lines anyway. As long as there's a line (or more) written, it won't change.

Advantages:

Allows ingesting as small as 1 byte (maybe 2 bytes actually, as we need the newline character)

Disadvantages:

If a file does not have a newline, it won't be ingested? (subject to verification, that's what happens in Vector's implementation)
Requires significant code changes

Jun 12 '25 12:06 andrzej-stencel

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Jun 12 '25 12:06 elasticmachine

I like fingerprint.lines approach, but I suspect it may not be enough on its own. We can probably combine fingerprint.lines with inode related information to get a robust identifier.

Jun 12 '25 12:06 VihasMakwana

Something to keep in mind: the 1KB default value was derived from the typical logging data that we sampled from our support.

We do have environments where we have multiple files with identical 1000 characters in the beginning.

If we start identifying files based on anything shorter than 1000 characters, the identifier would not be unique. Does not matter what approach we take. Many services log identical information on startup, multiple instances started at the same time might even log the same timestamps if they're quick enough.

Introducing an offset by default (we already support it by the way) would only delay file ingestion even further.

The way filestream currently works, it needs a stable file identifier that never changes in the lifetime of the file, so we have to wait until the file grows to the size which we can create the fingerprint hash from.

We still support the previous file identity and a few other options, see https://www.elastic.co/docs/reference/beats/filebeat/file-identity

I'm open for ideas how we can change our ingestion code but I don't think what filelog is doing would work for our customers. It does not change the semantics much, it's just a variable header size for hashing.

Jul 02 '25 13:07 rdner

I'm open for ideas how we can change our ingestion code but I don't think what filelog is doing would work for our customers.

We are already encouraging every customer that wants an OTel native solution to use filelog and its the only choice for this case. We are going to be anchored to comparisons with filelog indefinitely because of this.

What filelog is doing seems like it has a much better experience when getting started without any seemingly arbitrary limits (from the users perspective, I understand why we picked this value), and doesn't risk data loss from unexpectedly short files.

The way filestream currently works, it needs a stable file identifier that never changes in the lifetime of the file, so we have to wait until the file grows to the size which we can create the fingerprint hash from.

That filestream can't support a file identifier that changes seems like the main limitation to using filelog's prefix based approach. Is it possible for us to change this?

At some point we will be tasked with supporting migration from filestream to filelog, getting there file identities to converge I suspect will be a pre-requisite for this.

Jul 04 '25 21:07 cmacknz

This is affecting agent's own self-monitoring because the log files from the Elastic Agent watcher are frequently smaller than 1024 bytes. See an example in the diagnostics in https://github.com/elastic/elastic-agent/issues/9534, the logs are full of:

{"log.level":"warn","@timestamp":"2025-08-22T07:37:16.915Z","message":"3 files are too small to be ingested, files need to be at least 1024 in size for ingestion to start. To change this behaviour set 'prospector.scanner.fingerprint.length' and 'prospector.scanner.fingerprint.offset'. Enable debug logging to see all file names.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"filestream_id":"filestream-monitoring-agent","ecs.version":"1.6.0","log.logger":"input.scanner","log.origin":{"file.line":440,"file.name":"filestream/fswatch.go","function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*fileScanner).GetFiles"},"service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2025-08-22T07:37:26.915Z","message":"3 files are too small to be ingested, files need to be at least 1024 in size for ingestion to start. To change this behaviour set 'prospector.scanner.fingerprint.length' and 'prospector.scanner.fingerprint.offset'. Enable debug logging to see all file names.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","filestream_id":"filestream-monitoring-agent","ecs.version":"1.6.0","log.logger":"input.scanner","log.origin":{"file.line":440,"file.name":"filestream/fswatch.go","function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*fileScanner).GetFiles"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2025-08-22T07:37:36.915Z","message":"3 files are too small to be ingested, files need to be at least 1024 in size for ingestion to start. To change this behaviour set 'prospector.scanner.fingerprint.length' and 'prospector.scanner.fingerprint.offset'. Enable debug logging to see all file names.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"filestream_id":"filestream-monitoring-agent","ecs.version":"1.6.0","log.logger":"input.scanner","log.origin":{"file.line":440,"file.name":"filestream/fswatch.go","function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*fileScanner).GetFiles"},"service.name":"filebeat","ecs.version":"1.6.0"}

The watcher files are all 660 bytes:

~/Downloads/elastic-agent-diagnostics-2025-08-22T07-38-23Z-00
❯ ls -l logs/elastic-agent-9.2.0-SNAPSHOT-bce00e
total 2544
-rw-rw-r--@ 1 cmackenzie  staff  991231 Aug 22 03:34 elastic-agent-20250822-1.ndjson
-rw-rw-r--@ 1 cmackenzie  staff  280196 Aug 22 03:38 elastic-agent-20250822-2.ndjson
-rw-rw-r--@ 1 cmackenzie  staff   14763 Aug 22 03:06 elastic-agent-20250822.ndjson
-rw-rw-r--@ 1 cmackenzie  staff     660 Aug 22 03:06 elastic-agent-watcher-20250822-1.ndjson
-rw-rw-r--@ 1 cmackenzie  staff     660 Aug 22 03:34 elastic-agent-watcher-20250822-2.ndjson
-rw-rw-r--@ 1 cmackenzie  staff     660 Aug 22 03:06 elastic-agent-watcher-20250822.ndjson

The log content itself is two log lines:

{"log.level":"info","@timestamp":"2025-08-22T07:06:54.758Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.watchCmd","file.name":"cmd/watch.go","file.line":77},"message":"Upgrade Watcher started","process.pid":5366,"agent.version":"9.2.0","config":{"grace_period":600000000000,"error_check":{"interval":30000000000}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-08-22T07:06:54.760Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.watchCmd","file.name":"cmd/watch.go","file.line":86},"message":"update marker not present at '/opt/Elastic/Agent/data'","ecs.version":"1.6.0"}

These are normal log lines that will occur at every agent boot so we are adding quite a bit of log noise and we are unable to read our own log files by default. I think this increases the priority for addressing this.

Aug 22 '25 14:08 cmacknz

hey all,

I started to look at this. After a quick look, @cmacknz it seems you'd like us to offer a file-identity that works just like fileLog. Did I understand it right?

It seems to me any solution allowing for a short fingerprint would need to allow it to grow to avoid considering different files with the same header as the same.

That filestream can't support a file identifier that changes seems like the main limitation to using filelog's prefix based approach. Is it possible for us to change this?

I was giving it some thought, and it should be possible. We'd need to account for that when loading states from the store and comparing them to the files we find. Also the store itself is another issue. A naive approach would be if the fingerprint changes, we mark the old as a deleted file and create a new entry for the updated fingerprint.

The ack and persistence of the current state might be the trick part as any new event generated while the fingerprint is growing would be associated with the "old" fingerprint.

As each harvester handles a single file, it should be possible for it to know the fingerprint is changing and make the ack work.

Dec 02 '25 07:12 AndersonQ

What about updating the fingerprint in-place? Could it help preserve state and avoid unnecessary re-ingestion?

It would potentially introduce risks around file identity ambiguity and registry consistency but it might be feasible with additional safeguards. WDYT?

Dec 02 '25 07:12 pierrehilbert

what do you mean by "updating the fingerprint in-place". Updating it in-place where?

Dec 02 '25 07:12 AndersonQ

In the registry

Dec 02 '25 10:12 pierrehilbert

it isn't possible, the registry is append-only. That's the issue

Dec 02 '25 13:12 AndersonQ

Have we thought about introducing an "identity change" registry operation to allows to remap files to new identities without having to re-ingest them?

That would work with an append only registry.

Dec 02 '25 19:12 cmacknz

Not really, if I'm not mistaken there are plans to change the registry, so I was not considering changes to it. Besides the disk store is generic, just a key-value store. I can give it some thought, but at fist glace, feels rather hacky. Also the file identity is part of the key, so it'd be basically a "change key" operation.

Dec 03 '25 07:12 AndersonQ

There is an issue that we should evaluate changes to the on disk format yes, that's https://github.com/elastic/beats/issues/46939. There are no conclusions yet but I would agree if we do plan to change it, we should do that first and consider how to use those changes to address those problem.

Dec 03 '25 14:12 cmacknz

hey, I think it's doable. It'd be a new file identity, the matching has an extra step. but doable. There is still the issue of the store size if the fingerprint is the raw bytes, but we can address that later.

here is it in more details:

Proposed Solution

Introduce a new file identity type: growing_fingerprint

Key Design Decisions

Fingerprint = Raw bytes (hex-encoded), not a hash
- Allows prefix matching
- A hash would produce completely different values when file size changes and much more expensive/complex to do "prefix" matching.
Registry key format: growing_fingerprint::<raw_hex_bytes>
- Unchanged
- New entries as the key changes as the file grows
- Migration (copy state to new key, delete old key)
Lazy migration
- When a file event comes in and exact key match fails, we scan for prefix matches
- If found, migrate the entry to the new key
- No special startup logic needed
Collision accepted
- If two files have identical content at the beginning, they will share a registry entry until they diverge
- Once content diverges, each gets its own entry
- This is the same trade-off OTEL/fileLog makes
max_length configurable
- Same as current fingerprint length configuration

How It Works

File Tracking Flow

1. Scanner reads first min(file_size, max_length) bytes
2. Hex-encode as fingerprint → becomes part of registry key
3. On lookup:
   a. Try exact key match
   b. If no match, scan for entries where stored fingerprint is a prefix of current
   c. If prefix match found → same file grew → migrate to new key
   d. If no match → new file

Example: File Growth

Time 1: File is 500 bytes
  → Fingerprint: "48656c6c6f..." (500 bytes = 1000 hex chars)
  → Key: growing_fingerprint::48656c6c6f...
  → Cursor: offset=500

Time 2: File grows to 1000 bytes  
  → New fingerprint: "48656c6c6f...7a7a7a" (longer)
  → Exact key match fails
  → Prefix search finds old key (old fingerprint is prefix of new)
  → Migrate: copy state to new key, delete old key
  → Continue from offset=500

Example: File Truncation/Rotation

Time 1: File has fingerprint "aaabbbccc"
  → Key: growing_fingerprint::aaabbbccc

Time 2: File truncated and rewritten with different content
  → New fingerprint: "xxxyyyz"
  → No exact match, no prefix match
  → Treated as new file, start from offset=0

Example: Collision

File A (500 bytes): "Hello World\n..." → fingerprint "48656c6c6f..."
File B (500 bytes): "Hello World\n..." → same fingerprint

Both share the same registry entry until one grows differently.

File A grows: "Hello World\nAAAA..." → fingerprint "48656c6c6f...41414141"
File B grows: "Hello World\nBBBB..." → fingerprint "48656c6c6f...42424242"

Now they have different keys and are tracked separately.

Dec 04 '25 08:12 AndersonQ

There is also the memory usage for 1kb per file in memory. It can be addressed by using a hash, it just makes the "prefix" comparison more expensive as it's need to calculate a hash for each file a different hash to try to match the entries in the registry. It can be done as a later optimisation or from the beginning.

Dec 04 '25 08:12 AndersonQ

just for the record. I'm working on a POC for a growing fingerprint. Which I believe is the most delicate part, integrate a growing fingerprint on filestream. With that we can better access how a final version would be. specially if we want to for for a hash instead of the raw bytes.

Also perhaps we might consider testing if filelogreceover addresses the edge cases (several files with same header which grow and then differ from one another) to understand how our version might differ from it.

cc: @cmacknz, @nimarezainia

Dec 09 '25 15:12 AndersonQ

I put up the POC I did:

https://github.com/elastic/beats/pull/48025

I think the most interesting part are the tests so you can see what works.

The CI still needs to run to see if it didn't break anything

Dec 10 '25 15:12 AndersonQ

[Filebeat] Filestream: Remove the limitation of not ingesting files smaller than 1kB

Current situation

The problem

What is in scope for this issue

How others do it

File Log receiver in OpenTelemetry Collector

Tail plugin in Fluent Bit

File source in Vector

Possible solutions

Decrease the default fingerprint size to a value lower than 1kB

Disable hashing of fingerprints and compare prefixes like the File Log receiver

Store fingerprints of smaller files and compare them with longer

Allow to configure fingerprint.lines instead of fingerprint.length

Proposed Solution

Key Design Decisions

How It Works

File Tracking Flow

Example: File Growth

Example: File Truncation/Rotation

Example: Collision

Decrease the default fingerprint size to a value lower than `1kB`

Allow to configure `fingerprint.lines` instead of `fingerprint.length`