qdrant icon indicating copy to clipboard operation
qdrant copied to clipboard

storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard X from N to M

Open yasha-dev1 opened this issue 1 year ago • 6 comments

After some time from the deployment there seems to be a bug with raft consensus in Qdrant. I am using custom sharding in Qdrant with 50 shards, replication factor of 3 and 3 nodes. After indexing about 700K vectors, it seems like the nodes cannot sync. I observe the following errors:

qdrant 2024-02-17T08:49:33.620466Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 23 from 163633349341480 to 5558157984746553                               

qdrant 2024-02-17T08:49:33.688112Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 5 from 163633349341480 to 5558157984746553                                   

qdrant 2024-02-17T08:49:33.740899Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 49 from 163633349341480 to 5558157984746553                                  

qdrant 2024-02-17T08:49:33.811285Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 18 from 163633349341480 to 5558157984746553                                  

 qdrant 2024-02-17T08:49:33.869499Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 4 from 163633349341480 to 5558157984746553                                   

qdrant 2024-02-17T08:49:33.934505Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 30 from 163633349341480 to 5558157984746553                                  

qdrant 2024-02-17T08:49:33.986275Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: Shard 26 is already involved in transfer 163633349341480 -> 7098510233670791                                

qdrant 2024-02-17T08:49:34.038138Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 40 from 163633349341480 to 5558157984746553                                  

qdrant 2024-02-17T08:49:34.092200Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 23 from 163633349341480 to 5558157984746553                                  

above error can be seen in one of the nodes, and other nodes seem to not be able to synchronize with the aformentioned node, hence the following error in other nodes:

2024-02-17T08:52:24.668625Z ERROR qdrant::consensus: Failed to forward message Message { msg_type: MsgAppend, to: 163633349341480, from: 7098510233670791, term: 1088, log_term: 992, index: 237286, entries: [Entry { entry_type: EntryNormal, term: 992, index: 237287, data: [161, 110, 67, 111, 108, 108, 101, 99, 116, 105, 111, 110, 77, 101, 116, 97, 161, 110, 116, 114, 97, 110, 115, 102, 101, 114, 95, 115, 104, 97, 114, 100, 130, 107, 117, 114, 108, 115, 108, 97, 98, 95, 98, 111, 116, 161, 101, 65, 98, 111, 114, 116, 162, 104, 116, 114, 97, 110, 115, 102, 101, 114, 163, 104, 115, 104, 97, 114, 100, 95, 105, 100, 24, 30, 100, 102, 114, 111, 109, 27, 0, 0, 148, 210, 219, 169, 49, 40, 98, 116, 111, 27, 0, 19, 191, 29, 128, 73, 76, 57, 102, 114, 101, 97, 115, 111, 110, 111, 116, 114, 97, 110, 115, 102, 101, 114, 32, 102, 97, 105, 108, 101, 100], context: [], sync_log: false }], commit: 338981, commit_term: 0, snapshot: None, request_snapshot: 0, reject: false, reject_hint: 0, context: [], deprecated_priority: 0, priority: 0 } to message sender task 163633349341480: message sender task queue is full. Message will be dropped.    

Current Behavior

The cluster nodes cannot synchronise after few indexes

Steps to Reproduce

  1. create custom shard collection with 50 shards
  2. index 700K or more vectors
  3. the nodes cannot synchronise and above warnings are thrown, however, the pod in K8s doesn't die

Following is my collection config:

{
  "params": {
    "vectors": {
      "size": 768,
      "distance": "Cosine"
    },
    "shard_number": 1,
    "sharding_method": "custom",
    "replication_factor": 3,
    "write_consistency_factor": 1,
    "on_disk_payload": true
  },
  "hnsw_config": {
    "m": 16,
    "ef_construct": 100,
    "full_scan_threshold": 10000,
    "max_indexing_threads": 0,
    "on_disk": false
  },
  "optimizer_config": {
    "deleted_threshold": 0.2,
    "vacuum_min_vector_number": 1000,
    "default_segment_number": 0,
    "max_segment_size": null,
    "memmap_threshold": 20000,
    "indexing_threshold": 20000,
    "flush_interval_sec": 5,
    "max_optimization_threads": 1
  },
  "wal_config": {
    "wal_capacity_mb": 32,
    "wal_segments_ahead": 0
  },
  "quantization_config": {
    "scalar": {
      "type": "int8",
      "always_ram": true
    }
  }
}

and /cluster:

{
  "result": {
    "status": "enabled",
    "peer_id": 7098510233670791,
    "peers": {
      "7098510233670791": {
        "uri": "http://qdrant-0.qdrant-headless:6335/"
      },
      "5558157984746553": {
        "uri": "http://qdrant-1.qdrant-headless:6335/"
      },
      "163633349341480": {
        "uri": "http://qdrant-2.qdrant-headless:6335/"
      }
    },
    "raft_info": {
      "term": 1088,
      "commit": 339032,
      "pending_operations": 0,
      "leader": 7098510233670791,
      "role": "Leader",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2024-02-17T08:55:57.214602900Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.000007711
}

Expected Behavior

consensus without any bug

Context (Environment)

Using Helm chart with v1.7.4 qdrant image

yasha-dev1 avatar Feb 17 '24 08:02 yasha-dev1

Might it be the case that 50 shards is alot to handle for 3 Qdrant nodes? If so, how many shards is recommended? as stated in the docs, It's really hard to create new shards, since, they should be moved manually, So, we wanted to create just the needed amount of shards at the start to avoid adding new shards in near future

yasha-dev1 avatar Feb 17 '24 09:02 yasha-dev1

yes, it seems like 150 shards is too much.

generall avatar Feb 17 '24 10:02 generall

thanks for the quick reply. What is the recommended amount of shards with 3 nodes and 10M vectors? Maybe its also worth it to have a section about this in the docs. I will start with 10 shards (so overall 30 counting the replication factor) and will update this issue here.

yasha-dev1 avatar Feb 17 '24 10:02 yasha-dev1

For 10M vectors 1 shard is enough. Sharding is design to address scaling problem. Having 2 shards on one machine would allow you to scale to x2 machines in future relatively easy.

generall avatar Feb 17 '24 11:02 generall

But you are also using custom sharding, which means that you need to explicitly place your data into one or another shard. May I ask which criteria you are using for this placement?

generall avatar Feb 17 '24 11:02 generall

there is a crawler which indexes documents that are associated to domains and urls. The sharding is done by hash_fn(domainHost) % TOTAL_NUMBER_OF_SHARDS. so, it is garuenteed that constant number of shards exist

yasha-dev1 avatar Feb 18 '24 08:02 yasha-dev1

Qdrant crash with this message minimum once a day. When this event happens we see following: 2 of 3 nodes has 100% CPU overpowered (even logs are not shown) and one node is accessible but overpowered, we see messages like these: Screenshot from 2024-02-20 05-46-30

This is how performance looks like on crashed nodes: Screenshot from 2024-02-20 06-02-57 Screenshot from 2024-02-20 06-02-52 Screenshot from 2024-02-20 06-02-46

Our server setup: 3 pods, 8GB ram, 4 CPUs

Cluster will never recover after the crash, it is not enough to restart single pod or one by one - we always need to drop all pods and restart them in the same time. After restart it takes many mnutes 20+ to recover.

Collection:

  • 20 shards with 3 replicas (on each server should be one copy of each shard)
  • vector length 768
  • about 800k documents
  • we store just 10 metadatafields with string data shorter as 200 characters (no file contents or similar)
  • we upsert documents and delete old documents
  • custom sharding, we hash domain name into shards (number of documents in shards is not equal)

vzeman avatar Feb 20 '24 05:02 vzeman

Today qdrant crashed ~23:35, here is the grafana charts if it helps: Screenshot from 2024-02-22 06-48-15

vzeman avatar Feb 22 '24 05:02 vzeman

Could you elaborate on what Qdrant version you're running?

timvisee avatar Feb 22 '24 09:02 timvisee

version 1.7.4, we are going to try changing some configurations first try will be to decrease memmap to 10k

We update you in case we find any combination of settings, which will not crash every few hours

vzeman avatar Feb 22 '24 09:02 vzeman

If we assume that the problem is in (automated) shard transfers, then I don't think changing memory mapping configurations would help here.

In these periods of high CPU usage, are you sure the node isn't just optimizing segments? You would be able to see this by the collection status. If it's yellow it means its optimizing.

timvisee avatar Feb 22 '24 09:02 timvisee

@timvisee, can we add a metric to know which optimizer is working at a given time. As I understand from the docs, Qdrant uses three different optimizers

  1. Vacuum Optimizer for delete operations
  2. Merge Optimizer for update operations
  3. Indexing Optimizer for indexing vectors

according to the request logs, it seems that at the time of incident and after that index operations were taking place: image

However, that doesn't mean which optimizer is causing high CPU consumption. As with status, the status was green, and pods don't event get restarted

yasha-dev1 avatar Feb 22 '24 09:02 yasha-dev1

I've seen this being asked before, so it might be good to implement this based on user demand. We'll decide on this internally.

Information about optimizers is already exposed however. It is not in Prometheus format at this time, but it is listed at an HTTP endpoint.

Simply fetch this, note that the URL parameter is important here:

GET /telemetry?details_level=3

In the response it lists the optimization status per shard:

                "optimizations": {
                  "status": "ok",
                  "optimizations": {
                    "count": 0,
                    "fail_count": 2,
                    "last_responded": "2024-02-22T10:21:23.900Z"
                  },
                  "log": [
                    {
                      "name": "indexing",
                      "segment_ids": [
                        7510136630065649000,
                        491536206735816100
                      ],
                      "status": "done",
                      "start_at": "2024-02-22T10:21:23.856747599Z",
                      "end_at": "2024-02-22T10:21:23.904835842Z"
                    },
                    {
                      "name": "indexing",
                      "segment_ids": [
                        7510136630065649000,
                        491536206735816100
                      ],
                      "status": "optimizing",
                      "start_at": "2024-02-22T10:21:23.850787904Z",
                      "end_at": null
                    }
                  ]
                }

Note that this is per shard, so I can imagine that you'll be having a lot of them. Also, this log of optimizations is not persisted, and is lost when a node restarts.

timvisee avatar Feb 22 '24 10:02 timvisee

Update: qdrant didn't crash more than 24 hours with following configuration:

    "memmap_threshold": 4000,
    "indexing_threshold": 8000,

(We changed vector size to 1024 recently and we have 2.4M vectors)

Even there was problem with 2 nodes (from 3 nodes), cluster eventually recovered from Dead shards.

vzeman avatar Feb 26 '24 04:02 vzeman

Are you not seeing these messages anymore after the changes you've made?

qdrant 2024-02-17T08:49:33.620466Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 23 from 163633349341480 to 5558157984746553                               

qdrant 2024-02-17T08:49:33.688112Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 5 from 163633349341480 to 5558157984746553                                   

qdrant 2024-02-17T08:49:33.740899Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 49 from 163633349341480 to 5558157984746553                                  

qdrant 2024-02-17T08:49:33.811285Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 18 from 163633349341480 to 5558157984746553                                  

 qdrant 2024-02-17T08:49:33.869499Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 4 from 163633349341480 to 5558157984746553                                   

qdrant 2024-02-17T08:49:33.934505Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 30 from 163633349341480 to 5558157984746553                                  

qdrant 2024-02-17T08:49:33.986275Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: Shard 26 is already involved in transfer 163633349341480 -> 7098510233670791                                

qdrant 2024-02-17T08:49:34.038138Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 40 from 163633349341480 to 5558157984746553                                  

qdrant 2024-02-17T08:49:34.092200Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Bad request: There is no transfer for shard 23 from 163633349341480 to 5558157984746553                

timvisee avatar Feb 26 '24 09:02 timvisee

As we investigated further, these are our findings:

  1. The warnings happens after a spike in memory, after the spike, the qdrant node is trying to recover the shards and we see those warnings. (However, the node in kubernetes doesn't die, because it doesn't affect the health K8s API until some point in future. Not sure what happens) image

We were able to solve this issue, by experimenting to put all the load to disk:

PATCH collections/<my_collection>
{
  "quantization_config": {
                            "scalar": {
                              "type": "int8",
                              "always_ram": false
                            }
                          }
}

Now, the nodes are stable for few days. However, our bulk search increased from less than 0.05 s to 5 s.

Do you have any suggestions to use the memory more efficiently using memmap and define limit on memory at the same time? As I see in the docs, the only control a user has over memmap, is memmap_threshold, which is Maximum size (in kilobytes) of vectors to store in-memory per segment.

Is there any combination of configurations to avoid qdrant from using more memory and define a limit for number of quantized vectors to load in memory?

yasha-dev1 avatar Mar 04 '24 08:03 yasha-dev1

@yasha-dev1 Thanks for these further details! "always_ram": false already puts the data on disk through memory mapping.

Maybe choosing a different type of quantization with better compression is a good option. What embedding model are you currently using? Binary quantization might be a good option: https://qdrant.tech/documentation/guides/quantization/#binary-quantization

Is there any combination of configurations to avoid qdrant from using more memory and define a limit for number of quantized vectors to load in memory?

No. That is basically what memory mapping is used for.

Note that the type of disk you're using has a serious effect on how good memory mapping gets. We recommend against using HDDs, and highly recommend to use a local NVMe drive instead.

timvisee avatar Mar 04 '24 09:03 timvisee

We have considered quantization. We are not planning to use OpenAI or CohereAI Embedding models. We're rather using opensource models like multi-lingual e5 Large. So, I'm not sure if it gives the same accuracy when using binary quantization.

And we are using Nvme Block Storage. So, Yes its slower than a local Nvme drive. Have you experimented with e5 model, to measure how much accuracy is decreased when using binary quantization?

yasha-dev1 avatar Mar 04 '24 10:03 yasha-dev1

Have you experimented with e5 model, to measure how much accuracy is decreased when using binary quantization?

I don't think we have. I'd recommend to benchmark this yourself to ensure your data is compatible. Using binary quantization would compress the quantized data by another factor of 8.

timvisee avatar Mar 04 '24 10:03 timvisee

We can safely close this. feel free to reopen if needed. the underlying problem was memory problems which binary quantization and offloading the vectors to disk are methods to mitigate this issue

yasha-dev1 avatar Mar 12 '24 09:03 yasha-dev1