graylog2-server Optimizing indices after index rotation blocks master node's ingestion

Optimizing indices after index rotation blocks the master node's ingestion until completed. This is not a scalable behavior.

If the Graylog deployment has a large amount of ingestion, and rotation on P1D, the Graylog master node can stop processing messages for hours. The Graylog master does not communicate this to the load balancer, and as a result its buffers fill up.

Expected Behavior

Graylog Master node can continue to process message during Elasticsarch index optimisation.

Current Behavior

Graylog Master node cannot process messages during Elasticsarch index optimisation.

Possible Solution

Allow the master node to continue ingesting during index rotation. Shift force merge requests to a different thread pool which is not blocking ingestion.

Graylog Version: 4.2.x
Elasticsearch Version: 7.10

See customer ticket HS-666683796 for example.

Jan 24 '22 12:01 tellistone

Notes and ideas:

The force-merge operation runs in one thread only (For force merge operations, thread pool type is fixed with a size of 1 and an unbounded queue size.)
Force-merge blocks the client thread during a force merge. Calls to this API block until the merge is complete. If the client connection is lost before completion then the force merge process will continue in the background. Any new requests to force merge the same indices will also block until the ongoing force merge is complete.
elasticsearch_max_total_connections_per_route = 20 (graylog.conf) does the default value of connections per route(~per server) cause troubles when there are many force-merge requests triggered? How many indices are force-merged for the customer?
Possibility to force-merge several indices in one request, would not block more client threads.
Triggering the force-merge request as async call

Feb 18 '22 09:02 todvora

The elasticsearch_index_optimization_jobs is by default set to 20, same as elasticsearch_max_total_connections_per_route. Which means by default we allow optimization jobs to consume all elasticsearch client threads to be consumed by these jobs.

Mar 15 '22 07:03 todvora

:point_up: related to the elasticsearch_max_total_connections_per_route configuration.

Apr 05 '22 12:04 todvora

@todvora In the graphs you posted, does 10 threads, forcemerge enabled mean elasticsearch_index_optimization_jobs = 10 while elasticsearch_max_total_connections_per_route = 20 ?

Apr 05 '22 19:04 boosty

@boosty I am experimenting with one index only, so one optimization job is running every few minutes. The treads count means elasticsearch_max_total_connections_per_route, sorry for the confusion!

I haven't changed the elasticsearch_index_optimization_jobs value, since I am only testing on one index and this value refers to a count of concurrent running optimizations.

Apr 06 '22 05:04 todvora

@boosty @todvora do we have any update on this ticket? I have another customer experiencing the same behavior.

Apr 28 '22 16:04 ed-graylog

The solution is to set elasticsearch_index_optimization_jobs lower than elasticsearch_max_total_connections_per_route. This can be done in the server config. For example:

elasticsearch_max_total_connections = 200 (Graylog's default value)
elasticsearch_max_total_connections_per_route = 20 (Graylog's default value)
elasticsearch_index_optimization_jobs = 10 (Graylog's default is 20, but we reduce this to not let the optimization jobs block other ES calls)

Apr 29 '22 07:04 boosty

Hi guys following up here, it later versions of Graylog will we change this default? elasticsearch_index_optimization_jobs = 10 seems like a solid fix.

May 23 '22 08:05 tellistone

@tellistone Yes, I think the default for elasticsearch_index_optimization_jobs should be changed to 10.

@todvora Since you are assigned this this ticket, could you take care of this?

Sep 20 '22 14:09 boosty

graylog2-server graylog2-server copied to clipboard

Optimizing indices after index rotation blocks master node's ingestion

Expected Behavior

Current Behavior

Possible Solution

graylog2-server
graylog2-server copied to clipboard