elasticsearch
elasticsearch copied to clipboard
Operations to system indices should always use system threadpools
System threadpools are meant to be used for operations for system indices. For an example, thesystem_critical_write should be used for writing to the .security and .security-tokens indices. However, The threadpool switching happens at shard level. At index level, these operations still share the same threadpool as other operations for non-system indices.
For heavy ingestion use cases, e.g. Fleet, if the write threadpool gets saturated, it will lead to 429 rejection errors to system critical writes. For example, when the write threadpool is saturated, users won’t be able to create/invalidate API keys or oauth2 tokens. A sample rejection error is as the follows:
[es_rejected_execution_exception: [es_rejected_execution_exception] Reason: rejected execution of org.elasticsearch.action.bulk.TransportBulkAction$1/org.elasticsearch.action.ActionListener$RunBeforeActionListener/org.elasticsearch.action.ActionListener$DelegatingFailureActionListener/org.elasticsearch.action.support.ContextPreservingActionListener/org.elasticsearch.tasks.TaskManager$1{SafelyWrappedActionListener[listener=WrappedActionListener{org.elasticsearch.action.bulk.TransportSingleItemBulkWriteAction$$Lambda$8846/0x00000008020a4f58@50921e21}{org.elasticsearch.action.bulk.TransportSingleItemBulkWriteAction$$Lambda$8849/0x00000008020a5378@63f4b3b9}]}{Task{id=1055264, type='transport', action='indices:data/write/bulk', description='requests[1], indices[.security-tokens]', parentTask=unset, startTime=1658399095698, startTimeNanos=21774904411368219}}/org.elasticsearch.xpack.security.action.filter.SecurityActionFilter$$Lambda$6237/0x0000000801d90e58@61953e36/org.elasticsearch.action.bulk.TransportBulkAction$$Lambda$8017/0x0000000801f9d000@26a0d988 on EsThreadPoolExecutor[name = instance-0000000001/write, queue capacity = 10000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6ed3e9a8[Running, pool size = 3, active threads = 3, queued tasks = 10000, completed tasks = 429077]]]
This should not happen because the system critical threapool was introduced to avoid it in the first place. Though the above example is about the .security-tokens index and the system_critical_write threadpool, it is reasonable to believe this issue applies to all system indices and system threadpools.
Pinging @elastic/es-core-infra (Team:Core/Infra)
I wonder if the fix in transport bulk action should be conditionally forking based on current thread (i.e don't fork if currently executing in the system_critical_write) with the expectation that the code calling this would fork before calling the transport bulk action. That feels less expensive than iterating through all the bulk requests and comparing them to the known system indices to know to which threadpool to fork to and keeps the forking logic closer to actual usage.
We talked about this issue at the core/infra sync, and our naive thought was that this was an oversight. Looking at the code, though, I can see why using an ExecutorSelector here would get messy. We already look up indices for the request in a sorted map and find whether they are system indices or not. If we wanted to select between the system_write and system_critical_write thread pools, we would have to take the names of those system indices and look up which thread pool each index is supposed to use, then decide which thread pool to use for the bulk request. If we followed the current behavior, we'd choose the "least critical" thread pool.
If we were forking the thread, that would happen when a system feature creates a bulk transport request directly, right? For example, ApiKeyService#createApiKeyAndIndexIt would fork the thread before calling executeAsyncWithOrigin?
If we did that, then any external request to the bulk endpoint would not be able to use system threadpools, right? I don't know exactly how Fleet behaves, but, for example, if Fleet writes to its system indices with a REST call to the bulk endpoint, would everything would go to the WRITE threadpool?
for example, if Fleet writes to its system indices with a REST call to the bulk endpoint
Why would Fleet (or any other external client) write to system indices directly via the normal write path ? Shouldn't any requests that results to writes to system indices go through dedicated endpoints ? A dedicate endpoint would allow the ability to disambiguate (or pre-fork) which threadpool to use.
Fleet and Kibana both use normal APIs to access their system resources. We call this kind of system index an "external" system index. From the Javadoc on SystemIndexDescriptor.Type:
System indices can also belong to features outside of Elasticsearch that may be part of other Elastic stack components. These are external system indices as the intent is for these to be accessed via normal APIs with a special value.
We can unreliably detect this case, if ThreadContext is passed around correctly: there's a ThreadContext header called _external_system_index_access_origin that contains the product origin, and we could use that to divert to system thread pools. However, this header can be faked in an HTTP request with X-elastic-product-origin.
Closing this since it should be fixed by #106150. I do not fully follow the conversation here, but I assume it is because this refers to older versions of the code. Please reopen if you think it is not fully solved by the fix.