beats
beats copied to clipboard
[DOCS] Document side-effects of the loadbalance setting
I think that on the loadbalacing page, or where the loadbalance
option is mentioned we should warn about the following: If one of the Logstash workers is slow (as opposed to just down), it has the effect of slowing down all the other Logstash workers.
A few more details on how things work with loadbalance and without loadbalance would also be very helpful, I think (perhaps on a separate page).
From what I understand:
With loadbalance:
- Filebeat reads batches of events and send each batch to one Logstash worker (round-robin).
- If a connection is dropped, Filebeat takes the Logstash worker out of its pool
- It retries with exponential back-off to reconnect. If it succeeds, it re-adds the worker to the pool.
- If one of the Logstash nodes is alive but very slow, Filebeat won't move to new batches until the current in-flight batch is ACKed. This causes Filebeat to pause sending batches to the fast LS nodes, until the slow LS node answers, which slows down the whole ingestion.
Without loadbalance (the default):
- FIlebeat picks a random Logstash host, sends batches to that one. Due to the random algorithm, if you have a lot more Filebeat instances than LS nodes, the load on the LS nodes should be roughly equal.
- In case of errors, Filebeat picks another LS node, also at random. The failed host is only retried if there are errors on the new connection.
- This means that the LS nodes are more independent, each serving a distinct set of Filebeat clients.
@urso might want to check the above ^ :)
Pinging @dedemorton.
@tsg Thanks!.
Some more notes:
- LB is not round-robin, but 'dynamic' based on a work-queue shared between the outputs. We have host*worker outputs if LB is enabled.
- If Logstash is slow, but 'healthy', then it sends a keep-alive signal until the full batch is processed. This blocks FB from processing until the ACK is received. A long running grok statement can stall FB
- The reason for this behavior is Filebeat. In presenve of Load-balancing and IO errors in Filebeat we need to fix up the order of ACKs, such that we can forward ACKs to the registry file in the same order events have been published. This requires Filebeat to keep all events in memory until after ACK happend. Memory usage is restricted by the queue size.
@dedemorton @tsg Seems this lost track in the document update, would it be possible to add in the filebeat document to warn customer when enable load balancing for output?
Yes, I'm sorry. It's been in our backlog for awhile. I'll talk to the team to see if we can bump the priority.
Pinging @elastic/obs-docs (Team:Docs)
We have observed this in the scenario of Elasticsearch output with load balance option. If one of the coordinating nodes in the host array is slow, it resulted in not 50% but approx. 90% impairment of the overall throughput.
Added the needs-input label because we need a subject matter expert to confirm that the details in this issue are still correct and identify any gaps in the information.
Just out of curiosity, I've tested the Beats' load balance against 2x Logstash with 8.5.2 - the result appears to prove the side-effect is (still) observable.
Pipeline: yes command over udp -> Filebeat with loadbalance enabled -> 2x Logstash -> /dev/null
# docker-compose.yml
version: "3.9"
services:
logstash1:
image: docker.elastic.co/logstash/logstash:8.5.2
command: --quiet -e "input{beats{port=>5044}} output{file{path=>'/dev/null'}}"
environment:
- XPACK_MONITORING_ENABLED=false
deploy:
resources:
limits:
cpus: 1.0
logstash2:
image: docker.elastic.co/logstash/logstash:8.5.2
command: --quiet -e "input{beats{port=>5044}} output{file{path=>'/dev/null'}}"
environment:
- XPACK_MONITORING_ENABLED=false
deploy:
resources:
limits:
cpus: 1.0
# cpus: 0.1 # Intentional bottleneck
filebeat:
image: docker.elastic.co/beats/filebeat:8.5.2
command: >
-e
-E filebeat.inputs.0.type=udp
-E filebeat.inputs.0.host="0.0.0.0:12345"
-E output.elasticsearch.enabled=false
-E output.logstash.enabled=true
-E output.logstash.loadbalance=true
-E output.logstash.hosts=["logstash1:5044","logstash2:5044"]
-E monitoring.enabled=true
-E monitoring.cloud.id=changeme
-E monitoring.cloud.auth=changeme
busybox:
image: busybox
command: sh -c 'sleep 60 && yes | nc -u filebeat 12345'
Equal CPU resources on backend Logstash
With cpus=1.0 on both logstash1 and logstash2, the pipeline processes approx. 6500 events per second.
Weighted CPU resources on backend Logstash
With cpus=1.0 on logstash1, the pipeline processes approx. 4500 events per second. But, once cpus=0.1 on logstash2 is joined, the performance drops down to 1000 events per second.
bump. If I understand the issue correctly, we should add more warnings in our doc to avoid using loadbalance:true
option in case a slow LS is in the pool.
Btw, the original post mentioned that:
Without loadbalance (the default):
As of 8.10, the default has been set to true, ie with loadbalance
The documentation might be out of sync with the code here, since our code has the default for Logstash set to LoadBalance: false
:
https://github.com/elastic/beats/blob/e322104a8cc25b420f7764a50a9f801e7cba6aa0/libbeat/outputs/logstash/config.go#L53-L55
As of 8.10, the default has been set to true, ie with loadbalance
Looks like this was changed in https://github.com/elastic/beats/commit/046954da6283dedaa3d424e8c9905932db7a783f
CC @kilfoyle I suspect this might be a copy paste error (or an assumption things were consistent between outputs) from your commit above, since Logstash defaults to false but Elasticsearch and Redis both default to true
.
CC @kilfoyle I suspect this might be a copy paste error (or an assumption things were consistent between outputs) from your commit above, since Logstash defaults to false but Elasticsearch and Redis both default to true.
Thanks @cmacknz! Here's a PR to get that fixed.