beats icon indicating copy to clipboard operation
beats copied to clipboard

[DOCS] Document side-effects of the loadbalance setting

Open tsg opened this issue 5 years ago • 5 comments

I think that on the loadbalacing page, or where the loadbalance option is mentioned we should warn about the following: If one of the Logstash workers is slow (as opposed to just down), it has the effect of slowing down all the other Logstash workers.

A few more details on how things work with loadbalance and without loadbalance would also be very helpful, I think (perhaps on a separate page).

From what I understand:

With loadbalance:

  • Filebeat reads batches of events and send each batch to one Logstash worker (round-robin).
  • If a connection is dropped, Filebeat takes the Logstash worker out of its pool
  • It retries with exponential back-off to reconnect. If it succeeds, it re-adds the worker to the pool.
  • If one of the Logstash nodes is alive but very slow, Filebeat won't move to new batches until the current in-flight batch is ACKed. This causes Filebeat to pause sending batches to the fast LS nodes, until the slow LS node answers, which slows down the whole ingestion.

Without loadbalance (the default):

  • FIlebeat picks a random Logstash host, sends batches to that one. Due to the random algorithm, if you have a lot more Filebeat instances than LS nodes, the load on the LS nodes should be roughly equal.
  • In case of errors, Filebeat picks another LS node, also at random. The failed host is only retried if there are errors on the new connection.
  • This means that the LS nodes are more independent, each serving a distinct set of Filebeat clients.

@urso might want to check the above ^ :)

Pinging @dedemorton.

tsg avatar Jun 17 '19 10:06 tsg

@tsg Thanks!.

Some more notes:

  • LB is not round-robin, but 'dynamic' based on a work-queue shared between the outputs. We have host*worker outputs if LB is enabled.
  • If Logstash is slow, but 'healthy', then it sends a keep-alive signal until the full batch is processed. This blocks FB from processing until the ACK is received. A long running grok statement can stall FB
  • The reason for this behavior is Filebeat. In presenve of Load-balancing and IO errors in Filebeat we need to fix up the order of ACKs, such that we can forward ACKs to the registry file in the same order events have been published. This requires Filebeat to keep all events in memory until after ACK happend. Memory usage is restricted by the queue size.

urso avatar Jun 17 '19 10:06 urso

@dedemorton @tsg Seems this lost track in the document update, would it be possible to add in the filebeat document to warn customer when enable load balancing for output?

sophiaxu8 avatar Mar 17 '21 22:03 sophiaxu8

Yes, I'm sorry. It's been in our backlog for awhile. I'll talk to the team to see if we can bump the priority.

dedemorton avatar Mar 18 '21 04:03 dedemorton

Pinging @elastic/obs-docs (Team:Docs)

elasticmachine avatar Mar 18 '21 04:03 elasticmachine

We have observed this in the scenario of Elasticsearch output with load balance option. If one of the coordinating nodes in the host array is slow, it resulted in not 50% but approx. 90% impairment of the overall throughput.

ppf2 avatar Apr 26 '22 00:04 ppf2

Added the needs-input label because we need a subject matter expert to confirm that the details in this issue are still correct and identify any gaps in the information.

dedemorton avatar Nov 02 '22 18:11 dedemorton

Just out of curiosity, I've tested the Beats' load balance against 2x Logstash with 8.5.2 - the result appears to prove the side-effect is (still) observable.

Pipeline: yes command over udp -> Filebeat with loadbalance enabled -> 2x Logstash -> /dev/null
# docker-compose.yml

version: "3.9"
services:
  logstash1:
    image: docker.elastic.co/logstash/logstash:8.5.2
    command: --quiet -e "input{beats{port=>5044}} output{file{path=>'/dev/null'}}"
    environment:
      - XPACK_MONITORING_ENABLED=false
    deploy:
      resources:
        limits:
          cpus: 1.0
  logstash2:
    image: docker.elastic.co/logstash/logstash:8.5.2
    command: --quiet -e "input{beats{port=>5044}} output{file{path=>'/dev/null'}}"
    environment:
      - XPACK_MONITORING_ENABLED=false
    deploy:
      resources:
        limits:
          cpus: 1.0
          # cpus: 0.1  # Intentional bottleneck
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.5.2
    command: >
      -e
      -E filebeat.inputs.0.type=udp
      -E filebeat.inputs.0.host="0.0.0.0:12345"
      -E output.elasticsearch.enabled=false
      -E output.logstash.enabled=true
      -E output.logstash.loadbalance=true
      -E output.logstash.hosts=["logstash1:5044","logstash2:5044"]
      -E monitoring.enabled=true
      -E monitoring.cloud.id=changeme
      -E monitoring.cloud.auth=changeme
  busybox:
    image: busybox
    command: sh -c 'sleep 60 && yes | nc -u filebeat 12345'

Equal CPU resources on backend Logstash

With cpus=1.0 on both logstash1 and logstash2, the pipeline processes approx. 6500 events per second.

Equal CPU resources on subsequent Logstash

Weighted CPU resources on backend Logstash

With cpus=1.0 on logstash1, the pipeline processes approx. 4500 events per second. But, once cpus=0.1 on logstash2 is joined, the performance drops down to 1000 events per second.

Weighted CPU resources on subsequent Logstash

sakurai-youhei avatar Dec 03 '22 11:12 sakurai-youhei

bump. If I understand the issue correctly, we should add more warnings in our doc to avoid using loadbalance:true option in case a slow LS is in the pool.

Btw, the original post mentioned that:

Without loadbalance (the default):

As of 8.10, the default has been set to true, ie with loadbalance

Leaf-Lin avatar Oct 18 '23 00:10 Leaf-Lin

The documentation might be out of sync with the code here, since our code has the default for Logstash set to LoadBalance: false:

https://github.com/elastic/beats/blob/e322104a8cc25b420f7764a50a9f801e7cba6aa0/libbeat/outputs/logstash/config.go#L53-L55

As of 8.10, the default has been set to true, ie with loadbalance

Looks like this was changed in https://github.com/elastic/beats/commit/046954da6283dedaa3d424e8c9905932db7a783f

CC @kilfoyle I suspect this might be a copy paste error (or an assumption things were consistent between outputs) from your commit above, since Logstash defaults to false but Elasticsearch and Redis both default to true.

cmacknz avatar Oct 18 '23 19:10 cmacknz

CC @kilfoyle I suspect this might be a copy paste error (or an assumption things were consistent between outputs) from your commit above, since Logstash defaults to false but Elasticsearch and Redis both default to true.

Thanks @cmacknz! Here's a PR to get that fixed.

kilfoyle avatar Oct 18 '23 20:10 kilfoyle