elasticsearch
elasticsearch copied to clipboard
Uneven distribution of docs across shards, even with auto-generated ids
Elasticsearch Version
7.17.15
Installed Plugins
No response
Java Version
bundled
OS Version
Ubuntu 20..04.6 LTS
Problem Description
We've been running several parallel processes that all sent bulk-indexes to an index. The documents from a single process now seem to be very unevenly distributed across our shards.
Looking at one part, I find that GET indexname/_count?preference=_shards:
gives results ranging from 2215 to 143810 documents on a single shard.
Steps to Reproduce
Index creation
PUT myindex
{
"settings": {
"number_of_shards": 20,
"number_of_replicas": 1,
"refresh_interval": "300s",
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_warm,data_hot"
}
}
}
}
}
Bulk indexing
Spin up 6 different .NET projects, who all use the NEST-client, to bulk-index documents:
ElasticClient = new ElasticClient(connnectionSettings)
var results = ElasticClient.BulkAll(objects, b=>b.Index(myindex).
.BufferToBulk((descriptor, list) =>
{foreach(var obj in list) {descriptor.Index(i => i.Document(obj))
.RefreshOnCompleted(false)
.MaxDegreeOfParallelism(4)
.Size(10))
Expected behavior
Even distribution of all documents, also meaning the documents from process 1 are evenly spread, docs from process 2 are evenly spread, etc.
Observed behavior
While looking at all documents together, the spread is reasonably, but when just looking at documents from a single process, they disproportionately end up at a few shards
Logs (if relevant)
No response