qdrant
qdrant copied to clipboard
Distributed deployment cluster with a single dead shard fails to respond to queries
Running a distributed deployment of qdrant on kubernetes, single collection with replication factor set to 2. One shard failed and the cluster fails to respond to a query. I thought the loss of a single shard in this configuration shouldn't be a problem.
Current Behavior
The cluster fails to run a query
[38;20m2024-02-11 17:28:39 INFO Received input: session_id=62049 query='xxxx xxxxx xxxxx' team_id=1866 file_id=[3074553] sitemap_id=None env_name='xxxxxx'[0m
INFO: 100.100.29.236:39890 - "POST /qdrant/conversation HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/searchie/.local/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 97, in send
raise UnexpectedResponse.for_response(response)
qdrant_client.http.exceptions.UnexpectedResponse: Unexpected Response: 500 (Internal Server Error)
Raw response content:
b'{"status":{"error":"Service internal error: The replica set for shard 3 on peer 6358360577973509 does not have enough active replicas"},"time":0.000686151}'
Steps to Reproduce
- Have one shard of a distributed deployment cluster, with replication factor = 2, fail
- Run a query
Expected Behavior
Qdrant continues to index new vectors and respond to search requests.
Possible Solution
Context (Environment)
3 node cluster of qdrant 1.7.3 on kubernetes installed with helm chart qdrant-0.7.5. Each node is aws ec2 r6a.2xlarge with 8vCPU and 64 GB RAM, 100 GB gp3 EBS volume.
{
"result": {
"status": "enabled",
"peer_id": 3141207761334255,
"peers": {
"6358360577973509": {
"uri": "http://qdrant-1.qdrant-headless:6335/"
},
"419724648802618": {
"uri": "http://qdrant-2.qdrant-headless:6335/"
},
"3141207761334255": {
"uri": "http://qdrant-0.qdrant-headless:6335/"
}
},
"raft_info": {
"term": 364,
"commit": 14441,
"pending_operations": 0,
"leader": 6358360577973509,
"role": "Follower",
"is_voter": true
},
"consensus_thread_status": {
"consensus_thread_status": "working",
"last_update": "2024-02-12T12:40:42.218687606Z"
},
"message_send_failures": {}
},
"status": "ok",
"time": 0.000006651
}
{
"result": {
"status": "green",
"optimizer_status": "ok",
"vectors_count": 14122727,
"indexed_vectors_count": 14105677,
"points_count": 14122470,
"segments_count": 48,
"config": {
"params": {
"vectors": {
"size": 768,
"distance": "Cosine",
"on_disk": true
},
"shard_number": 6,
"replication_factor": 2,
"write_consistency_factor": 1,
"on_disk_payload": true
},
"hnsw_config": {
"m": 16,
"ef_construct": 100,
"full_scan_threshold": 10000,
"max_indexing_threads": 0,
"on_disk": true
},
"optimizer_config": {
"deleted_threshold": 0.2,
"vacuum_min_vector_number": 1000,
"default_segment_number": 0,
"max_segment_size": null,
"memmap_threshold": null,
"indexing_threshold": 20000,
"flush_interval_sec": 5,
"max_optimization_threads": 1
},
"wal_config": {
"wal_capacity_mb": 32,
"wal_segments_ahead": 0
},
"quantization_config": null
},
"payload_schema": {
"metadata.model_id": {
"data_type": "integer",
"points": 14122470
},
"metadata.team_id": {
"data_type": "integer",
"points": 14122470
}
}
},
"status": "ok",
"time": 0.001136403
}
{
"result": {
"peer_id": 6358360577973509,
"shard_count": 6,
"local_shards": [
{
"shard_id": 0,
"points_count": 2341907,
"state": "Active"
},
{
"shard_id": 2,
"points_count": 1944223,
"state": "Active"
},
{
"shard_id": 3,
"points_count": 2434372,
"state": "Dead"
},
{
"shard_id": 5,
"points_count": 2756990,
"state": "Active"
}
],
"remote_shards": [
{
"shard_id": 0,
"peer_id": 419724648802618,
"state": "Active"
},
{
"shard_id": 1,
"peer_id": 3141207761334255,
"state": "Active"
},
{
"shard_id": 1,
"peer_id": 419724648802618,
"state": "Active"
},
{
"shard_id": 2,
"peer_id": 3141207761334255,
"state": "Active"
},
{
"shard_id": 3,
"peer_id": 419724648802618,
"state": "Active"
},
{
"shard_id": 4,
"peer_id": 3141207761334255,
"state": "Active"
},
{
"shard_id": 4,
"peer_id": 419724648802618,
"state": "Active"
},
{
"shard_id": 5,
"peer_id": 3141207761334255,
"state": "Active"
}
],
"shard_transfers": []
},
"status": "ok",
"time": 0.000043171
}
The cluster writes log like this every 10 seconds
qdrant-1 qdrant 2024-02-12T13:18:49.889916Z WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Wrong input: Cannot deactivate the last active replica 419724648802618 of shard 3
qdrant-2 qdrant 2024-02-12T13:18:49.894157Z WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Wrong input: Cannot deactivate the last active replica 419724648802618 of shard 3
qdrant-0 qdrant 2024-02-12T13:18:49.894543Z WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Wrong input: Cannot deactivate the last active replica 419724648802618 of shard 3
Detailed Description
Hey @azhelev, could you please check if your cluster have consistent state? You need to make sure that
"raft_info": {
"term": 364,
"commit": 14441,
are same on all nodes.
Hi @generall the state looks fine
{
"result": {
"status": "enabled",
"peer_id": 3141207761334255,
"peers": {
"419724648802618": {
"uri": "http://qdrant-2.qdrant-headless:6335/"
},
"3141207761334255": {
"uri": "http://qdrant-0.qdrant-headless:6335/"
},
"6358360577973509": {
"uri": "http://qdrant-1.qdrant-headless:6335/"
}
},
"raft_info": {
"term": 364,
"commit": 15599,
"pending_operations": 0,
"leader": 6358360577973509,
"role": "Follower",
"is_voter": true
},
"consensus_thread_status": {
"consensus_thread_status": "working",
"last_update": "2024-02-12T15:53:53.509658301Z"
},
"message_send_failures": {}
},
"status": "ok",
"time": 0.00000703
}
{
"result": {
"status": "enabled",
"peer_id": 6358360577973509,
"peers": {
"419724648802618": {
"uri": "http://qdrant-2.qdrant-headless:6335/"
},
"3141207761334255": {
"uri": "http://qdrant-0.qdrant-headless:6335/"
},
"6358360577973509": {
"uri": "http://qdrant-1.qdrant-headless:6335/"
}
},
"raft_info": {
"term": 364,
"commit": 15600,
"pending_operations": 0,
"leader": 6358360577973509,
"role": "Leader",
"is_voter": true
},
"consensus_thread_status": {
"consensus_thread_status": "working",
"last_update": "2024-02-12T15:54:08.275015449Z"
},
"message_send_failures": {}
},
"status": "ok",
"time": 0.00000626
}
{
"result": {
"status": "enabled",
"peer_id": 419724648802618,
"peers": {
"3141207761334255": {
"uri": "http://qdrant-0.qdrant-headless:6335/"
},
"419724648802618": {
"uri": "http://qdrant-2.qdrant-headless:6335/"
},
"6358360577973509": {
"uri": "http://qdrant-1.qdrant-headless:6335/"
}
},
"raft_info": {
"term": 364,
"commit": 15601,
"pending_operations": 0,
"leader": 6358360577973509,
"role": "Follower",
"is_voter": true
},
"consensus_thread_status": {
"consensus_thread_status": "working",
"last_update": "2024-02-12T15:54:16.767143549Z"
},
"message_send_failures": {}
},
"status": "ok",
"time": 0.00001385
}
which peer is down? I assume 419724648802618
From what i understand no peer is down, they are all up, GET /readyz endpoint responds with 200 OK on all of them. Only shard id 3 on 6358360577973509 is marked as Dead.
hm, I don't see a reason why request would fail in this configuration. Also interesting that shard recovery is not initialized
Same here:
GET /cluster
{
"result": {
"status": "enabled",
"peer_id": 399284531266390,
"peers": {
"3623990355959938": {
"uri": "http://urlslab-qdrant-1.urlslab-qdrant-headless:6335/"
},
"3536642733441919": {
"uri": "http://urlslab-qdrant-2.urlslab-qdrant-headless:6335/"
},
"399284531266390": {
"uri": "http://urlslab-qdrant-0.urlslab-qdrant-headless:6335/"
}
},
"raft_info": {
"term": 29,
"commit": 48609,
"pending_operations": 0,
"leader": 399284531266390,
"role": "Leader",
"is_voter": true
},
"consensus_thread_status": {
"consensus_thread_status": "working",
"last_update": "2024-02-18T07:58:01.971770346Z"
},
"message_send_failures": {}
},
"status": "ok",
"time": 0.00001001
}
Qdrant Server logs:
qdrant 2024-02-18T07:59:37.044781Z WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Wrong input: Cannot deactivate the last active replica 3536642733441919 of shard 18
in Query time:
Service internal error: The replica set for shard 10 on peer 399284531266390 has no active replica
@generall , I think a part of the issue lies in the fact that the shards don't get restarted correctly when they are dead. When I restarted all qdrant nodes, everything went back to normal again. but then, the same issue happens after a while... there could be a bug in shard recovery
could you please describe a scenario when this happened?
We also experience this bug using a distributed qdrant deployment. When we upload a collection from a snapshot, the collection exists on all nodes but does only contain points on the node that processed the upload
When we upload a collection from a snapshot, the collection exists on all nodes but does only contain points on the node that processed the upload
Please make sure you are following the steps of the tutorial correctly - https://qdrant.tech/documentation/tutorials/create-snapshot/
Especially, that you are using ?priority=snapshot
parameter on recovery
Are there any Plans to implement a way qdrant keeps aware of distributing data over the nodes itself, so you have to upload a collection only once to a dashboard or single API endpoint? We are deploying QDrant in a Kubernetes cluster, the dashboard is behind a Service that has routes to the Endpoints of the pods, so we dont have Control over the traffic routing - we can communicate with the single pods inside the cluster, but not from outside. Would be nice if qdrant is able to handle that, just like distributed database systems like Stolon.