elasticsearch-net icon indicating copy to clipboard operation
elasticsearch-net copied to clipboard

Uneven distribution of docs across shards, even with auto-generated ids

Open EmilBode opened this issue 4 months ago • 2 comments

Elasticsearch Version

7.17.15

Installed Plugins

No response

Java Version

bundled

OS Version

Ubuntu 20..04.6 LTS

Problem Description

We've been running several parallel processes that all sent bulk-indexes to an index. The documents from a single process now seem to be very unevenly distributed across our shards. Looking at one part, I find that GET indexname/_count?preference=_shards: gives results ranging from 2215 to 143810 documents on a single shard.

Steps to Reproduce

Index creation

PUT myindex
{
  "settings": {
    "number_of_shards": 20,
    "number_of_replicas": 1,
    "refresh_interval": "300s",
    "routing": {
      "allocation": {
        "include": {
          "_tier_preference": "data_warm,data_hot"
        }
      }
    }
  }
}

Bulk indexing

Spin up 6 different .NET projects, who all use the NEST-client, to bulk-index documents:

ElasticClient = new ElasticClient(connnectionSettings)
var results = ElasticClient.BulkAll(objects, b=>b.Index(myindex).
    .BufferToBulk((descriptor, list) => 
        {foreach(var obj in list) {descriptor.Index(i => i.Document(obj))
    .RefreshOnCompleted(false)
    .MaxDegreeOfParallelism(4)
    .Size(10))

Expected behavior

Even distribution of all documents, also meaning the documents from process 1 are evenly spread, docs from process 2 are evenly spread, etc.

Observed behavior

While looking at all documents together, the spread is reasonably, but when just looking at documents from a single process, they disproportionately end up at a few shards

Logs (if relevant)

No response

EmilBode avatar Feb 16 '24 12:02 EmilBode