graylog2-server
graylog2-server copied to clipboard
Optimizing indices after index rotation blocks master node's ingestion
Optimizing indices after index rotation blocks the master node's ingestion until completed. This is not a scalable behavior.
If the Graylog deployment has a large amount of ingestion, and rotation on P1D, the Graylog master node can stop processing messages for hours. The Graylog master does not communicate this to the load balancer, and as a result its buffers fill up.
Expected Behavior
Graylog Master node can continue to process message during Elasticsarch index optimisation.
Current Behavior
Graylog Master node cannot process messages during Elasticsarch index optimisation.
Possible Solution
Allow the master node to continue ingesting during index rotation. Shift force merge requests to a different thread pool which is not blocking ingestion.
- Graylog Version: 4.2.x
- Elasticsearch Version: 7.10
See customer ticket HS-666683796 for example.
Notes and ideas:
- The force-merge operation runs in one thread only (For force merge operations, thread pool type is fixed with a size of 1 and an unbounded queue size.)
- Force-merge blocks the client thread during a force merge. Calls to this API block until the merge is complete. If the client connection is lost before completion then the force merge process will continue in the background. Any new requests to force merge the same indices will also block until the ongoing force merge is complete.
-
elasticsearch_max_total_connections_per_route = 20
(graylog.conf) does the default value of connections per route(~per server) cause troubles when there are many force-merge requests triggered? How many indices are force-merged for the customer? - Possibility to force-merge several indices in one request, would not block more client threads.
- Triggering the force-merge request as async call
The elasticsearch_index_optimization_jobs
is by default set to 20, same as elasticsearch_max_total_connections_per_route
. Which means by default we allow optimization jobs to consume all elasticsearch client threads to be consumed by these jobs.
:point_up: related to the elasticsearch_max_total_connections_per_route
configuration.
@todvora In the graphs you posted, does 10 threads, forcemerge enabled
mean elasticsearch_index_optimization_jobs = 10
while elasticsearch_max_total_connections_per_route = 20
?
@boosty I am experimenting with one index only, so one optimization job is running every few minutes. The treads count means elasticsearch_max_total_connections_per_route
, sorry for the confusion!
I haven't changed the elasticsearch_index_optimization_jobs
value, since I am only testing on one index and this value refers to a count of concurrent running optimizations.
@boosty @todvora do we have any update on this ticket? I have another customer experiencing the same behavior.
The solution is to set elasticsearch_index_optimization_jobs
lower than elasticsearch_max_total_connections_per_route
. This can be done in the server config. For example:
-
elasticsearch_max_total_connections = 200
(Graylog's default value) -
elasticsearch_max_total_connections_per_route = 20
(Graylog's default value) -
elasticsearch_index_optimization_jobs = 10
(Graylog's default is 20, but we reduce this to not let the optimization jobs block other ES calls)
Hi guys following up here, it later versions of Graylog will we change this default? elasticsearch_index_optimization_jobs = 10 seems like a solid fix.
@tellistone Yes, I think the default for elasticsearch_index_optimization_jobs
should be changed to 10
.
@todvora Since you are assigned this this ticket, could you take care of this?