This has been tested with Filebeat but the bug is on libbeat, so it's likely affecting all beats. The OS tested was Linux but the issue is not dependant on OS.

Description

When Filebeat, using the filestream input, (other inputs are likely affected as well, the log input is not affected) is configured to harvest more than 32767 files at once it will panic. This happens because for each file, two elements are added to a slice of channels by Pipeline.runSitnalPropagation, once this slice reaches 65536 elements, a reflect.Select on this slice will cause Filebeat to panic. The panic happens here: https://github.com/elastic/beats/blob/ae312c5b288c8377ed1a87adda80e153c363cea5/libbeat/publisher/pipeline/pipeline.go#L324.

For every new client this infinite for loop adds two elements to the slice of channels https://github.com/elastic/beats/blob/ae312c5b288c8377ed1a87adda80e153c363cea5/libbeat/publisher/pipeline/pipeline.go#L332-L346

Once the slice contains 65536 or more, then Filebeat panics with a message/stacktrace like this

panic: reflect.Select: too many cases (max 65536)
goroutine 277 [running]:
reflect.Select({0xc26e47e000?, 0xc360100000?, 0xc024488f10?})
reflect/value.go:2873 +0x79a
github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*Pipeline).runSignalPropagation(0xc02f6b47b8?)
github.com/elastic/beats/v7/libbeat/publisher/pipeline/pipeline.go:330 +0x1d8
created by github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*Pipeline).registerSignalPropagation.func1
github.com/elastic/beats/v7/libbeat/publisher/pipeline/pipeline.go:315 +0x97

How to reproduce

The easiest way to reproduce this issue is to create 33000 small log files, then start Filebeat to harvest them. While testing I faced some issues/OS limitations when trying to have all files being constantly updated.

You can use anything to generate the files, I used the following shell script and flog

#!/bin/bash

for i in $(seq 1 $1)
do
    path=$(printf "/tmp/log%010d.log\n" $i)
    flog -s 1 -n 100 > $path
    echo $path
done

Save it as gen-logs.sh, make it executable and run ./gen-logs.sh 33000.

Use the file output for the simplicity and easy to validate all logs have been ingested. The output and logging configuration ensure there will be a single log file and single output file.

filebeat.inputs:
  - type: filestream
    id: my-unique-id-per-filebeat-process
    paths:
      - /tmp/log*.log
    #harvester_limit: 32000

output:
  file:
    enabled: true
    path: ${path.home}/output
    filename: out
    rotate_every_kb: 100000000
    rotate_on_startup: false

logging:
  level: debug
  selectors:
    - crawler
    - file_watcher
    - input
    - input.filestream
    - input.harvester
    - registrar
    - scanner
    - service
  files:
    rotateeverybytes: 10485760000
    rotateonstartup: false

Start Filebeat and wait until it panics.

If you uncomment the #harvester_limit: 32000 Filebeat will work without issues and ingest all files. If using the script provided, there should be 3300000 events in the output file, you can verify that with:

wc -l output/*

Workaround

One workaround is to set the harvester_limit to a number smaller than 32000 if using a single input. If using multiple affected inputs all inputs should have limits to the number of pipeline clients they create in a way that for a running Filebeat process there will never be more than 32000 pipeline clients running concurrently.

For the filestream input, here is an example configuration:

filebeat.inputs:
  - type: filestream
    id: my-unique-id-per-filebeat-process
    paths:
      - /tmp/log*.log
    harvester_limit: 32000

Mar 06 '24 14:03 belimawr

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Mar 06 '24 14:03 elasticmachine

One workaround is to set the harvester_limit to a number smaller than 32000 if using a single input.

How is this a work around? Do the additional harvesters still execute after waiting for some undefined period of time? Or do they possibly never execute? If the latter, this seems more like converting an obvious failure into a silent failure.

Of the 32K active harvesters, how many of them are actually sending data concurrently? I wonder if it would be more effective to put the beat.Clients into something like a sync.Pool so that they are only actually kept around if they are used?

Is there a way to tell how many of the pipeline clients are idle and for how long?

Mar 06 '24 15:03 cmacknz

Another work around at the input level would be to figure out a way to have multiple inputs harvest the set of files, essentially sharding the input work load.

We could also shard or change the structure of the select cases, it looks like we are only waiting on the done signals. We could just create an another select case once we go past the limit on how many a single select can handle:

https://github.com/elastic/beats/blob/ae312c5b288c8377ed1a87adda80e153c363cea5/libbeat/publisher/pipeline/pipeline.go#L332-L342

Mar 06 '24 15:03 cmacknz

How is this a work around? Do the additional harvesters still execute after waiting for some undefined period of time? Or do they possibly never execute? If the latter, this seems more like converting an obvious failure into a silent failure.

The harvester_limit limits the number of concurrent harvesters, once the limit is reached new harvesters are not created until the old ones are closed.

Of the 32K active harvesters, how many of them are actually sending data concurrently?

On my tests, not many as the files were not being updated. Theoretically, all files could be live updated, which would also generate other issues like the not running harvesters starving. That is definitely a scale edge case we do not cover well.

I wonder if it would be more effective to put the beat.Clients into something like a sync.Pool so that they are only actually kept around if they are used?

I did not investigate the code to see how we could do it mitigate it. At least on Filestream there is an infinity loop that reads from the file and then publishes the line/event read. Those beat.Client contain some configuration for close and ACK handling, on the top of my head I don't know how much re-usable they are.

The affected inputs are the ones that call `pipeline.ConnectWith) https://github.com/elastic/beats/blob/ae312c5b288c8377ed1a87adda80e153c363cea5/libbeat/publisher/pipeline/pipeline.go#L211-L215 and as argument they pass the configuration they want which includes a processors list :cry:

https://github.com/elastic/beats/blob/ae312c5b288c8377ed1a87adda80e153c363cea5/libbeat/beat/pipeline.go#L47-L50

https://github.com/elastic/beats/blob/ae312c5b288c8377ed1a87adda80e153c363cea5/libbeat/beat/pipeline.go#L103-L137

We probably can re-use the same beat.Client for all harvesters belonging to the same filestream input. Other inputs would have to implement their own solution.

Is there a way to tell how many of the pipeline clients are idle and for how long?

I did not look into this. On my test case I had 33000 files and 100 lines on each. So most of the havesters/clients were all idle for sure.

We could just create an another select case once we go past the limit on how many a single select can handle

That sounds like a pretty good idea! I can easily see that working.

Mar 06 '24 15:03 belimawr

We probably can re-use the same beat.Client for all harvesters belonging to the same filestream input.

We likely wouldn't want to do this because the Publish method of the client executes the input processors while holding a single mutex. In effect this means only one event can be published at a time, the beat.Client is not concurrent. So if you share clients you'll limit concurrency significantly.

We could just create an another select case once we go past the limit on how many a single select can handle

That sounds like a pretty good idea! I can easily see that working.

Agreed I think this is the idea to pursue first.

Mar 06 '24 16:03 cmacknz

beats
beats copied to clipboard

Beats panic if there are more than 32767 pipeline clients

Description

How to reproduce

Workaround

beats beats copied to clipboard

Beats panic if there are more than 32767 pipeline clients

Description

How to reproduce

Workaround

beats
beats copied to clipboard