qdrant icon indicating copy to clipboard operation
qdrant copied to clipboard

Distributed deployment cluster with a single dead shard fails to respond to queries

Open azhelev opened this issue 1 year ago • 11 comments

Running a distributed deployment of qdrant on kubernetes, single collection with replication factor set to 2. One shard failed and the cluster fails to respond to a query. I thought the loss of a single shard in this configuration shouldn't be a problem.

Current Behavior

The cluster fails to run a query

[38;20m2024-02-11 17:28:39 INFO Received input: session_id=62049 query='xxxx xxxxx xxxxx' team_id=1866 file_id=[3074553] sitemap_id=None env_name='xxxxxx'[0m
INFO:     100.100.29.236:39890 - "POST /qdrant/conversation HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  
  File "/home/searchie/.local/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 97, in send
    raise UnexpectedResponse.for_response(response)
qdrant_client.http.exceptions.UnexpectedResponse: Unexpected Response: 500 (Internal Server Error)
Raw response content:
b'{"status":{"error":"Service internal error: The replica set for shard 3 on peer 6358360577973509 does not have enough active replicas"},"time":0.000686151}'

Steps to Reproduce

  1. Have one shard of a distributed deployment cluster, with replication factor = 2, fail
  2. Run a query

Expected Behavior

Qdrant continues to index new vectors and respond to search requests.

Possible Solution

Context (Environment)

3 node cluster of qdrant 1.7.3 on kubernetes installed with helm chart qdrant-0.7.5. Each node is aws ec2 r6a.2xlarge with 8vCPU and 64 GB RAM, 100 GB gp3 EBS volume.

{
  "result": {
    "status": "enabled",
    "peer_id": 3141207761334255,
    "peers": {
      "6358360577973509": {
        "uri": "http://qdrant-1.qdrant-headless:6335/"
      },
      "419724648802618": {
        "uri": "http://qdrant-2.qdrant-headless:6335/"
      },
      "3141207761334255": {
        "uri": "http://qdrant-0.qdrant-headless:6335/"
      }
    },
    "raft_info": {
      "term": 364,
      "commit": 14441,
      "pending_operations": 0,
      "leader": 6358360577973509,
      "role": "Follower",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2024-02-12T12:40:42.218687606Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.000006651
}
{
  "result": {
    "status": "green",
    "optimizer_status": "ok",
    "vectors_count": 14122727,
    "indexed_vectors_count": 14105677,
    "points_count": 14122470,
    "segments_count": 48,
    "config": {
      "params": {
        "vectors": {
          "size": 768,
          "distance": "Cosine",
          "on_disk": true
        },
        "shard_number": 6,
        "replication_factor": 2,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": true
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": null,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": 1
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": null
    },
    "payload_schema": {
      "metadata.model_id": {
        "data_type": "integer",
        "points": 14122470
      },
      "metadata.team_id": {
        "data_type": "integer",
        "points": 14122470
      }
    }
  },
  "status": "ok",
  "time": 0.001136403
}
{
  "result": {
    "peer_id": 6358360577973509,
    "shard_count": 6,
    "local_shards": [
      {
        "shard_id": 0,
        "points_count": 2341907,
        "state": "Active"
      },
      {
        "shard_id": 2,
        "points_count": 1944223,
        "state": "Active"
      },
      {
        "shard_id": 3,
        "points_count": 2434372,
        "state": "Dead"
      },
      {
        "shard_id": 5,
        "points_count": 2756990,
        "state": "Active"
      }
    ],
    "remote_shards": [
      {
        "shard_id": 0,
        "peer_id": 419724648802618,
        "state": "Active"
      },
      {
        "shard_id": 1,
        "peer_id": 3141207761334255,
        "state": "Active"
      },
      {
        "shard_id": 1,
        "peer_id": 419724648802618,
        "state": "Active"
      },
      {
        "shard_id": 2,
        "peer_id": 3141207761334255,
        "state": "Active"
      },
      {
        "shard_id": 3,
        "peer_id": 419724648802618,
        "state": "Active"
      },
      {
        "shard_id": 4,
        "peer_id": 3141207761334255,
        "state": "Active"
      },
      {
        "shard_id": 4,
        "peer_id": 419724648802618,
        "state": "Active"
      },
      {
        "shard_id": 5,
        "peer_id": 3141207761334255,
        "state": "Active"
      }
    ],
    "shard_transfers": []
  },
  "status": "ok",
  "time": 0.000043171
}

The cluster writes log like this every 10 seconds

qdrant-1 qdrant 2024-02-12T13:18:49.889916Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Wrong input: Cannot deactivate the last active replica 419724648802618 of shard 3    
qdrant-2 qdrant 2024-02-12T13:18:49.894157Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Wrong input: Cannot deactivate the last active replica 419724648802618 of shard 3    
qdrant-0 qdrant 2024-02-12T13:18:49.894543Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Wrong input: Cannot deactivate the last active replica 419724648802618 of shard 3

Detailed Description

azhelev avatar Feb 12 '24 13:02 azhelev

Hey @azhelev, could you please check if your cluster have consistent state? You need to make sure that

    "raft_info": {
      "term": 364,
      "commit": 14441,

are same on all nodes.

generall avatar Feb 12 '24 14:02 generall

Hi @generall the state looks fine

{
  "result": {
    "status": "enabled",
    "peer_id": 3141207761334255,
    "peers": {
      "419724648802618": {
        "uri": "http://qdrant-2.qdrant-headless:6335/"
      },
      "3141207761334255": {
        "uri": "http://qdrant-0.qdrant-headless:6335/"
      },
      "6358360577973509": {
        "uri": "http://qdrant-1.qdrant-headless:6335/"
      }
    },
    "raft_info": {
      "term": 364,
      "commit": 15599,
      "pending_operations": 0,
      "leader": 6358360577973509,
      "role": "Follower",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2024-02-12T15:53:53.509658301Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.00000703
}
{
  "result": {
    "status": "enabled",
    "peer_id": 6358360577973509,
    "peers": {
      "419724648802618": {
        "uri": "http://qdrant-2.qdrant-headless:6335/"
      },
      "3141207761334255": {
        "uri": "http://qdrant-0.qdrant-headless:6335/"
      },
      "6358360577973509": {
        "uri": "http://qdrant-1.qdrant-headless:6335/"
      }
    },
    "raft_info": {
      "term": 364,
      "commit": 15600,
      "pending_operations": 0,
      "leader": 6358360577973509,
      "role": "Leader",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2024-02-12T15:54:08.275015449Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.00000626
}
{
  "result": {
    "status": "enabled",
    "peer_id": 419724648802618,
    "peers": {
      "3141207761334255": {
        "uri": "http://qdrant-0.qdrant-headless:6335/"
      },
      "419724648802618": {
        "uri": "http://qdrant-2.qdrant-headless:6335/"
      },
      "6358360577973509": {
        "uri": "http://qdrant-1.qdrant-headless:6335/"
      }
    },
    "raft_info": {
      "term": 364,
      "commit": 15601,
      "pending_operations": 0,
      "leader": 6358360577973509,
      "role": "Follower",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2024-02-12T15:54:16.767143549Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.00001385
}

azhelev avatar Feb 12 '24 15:02 azhelev

which peer is down? I assume 419724648802618

generall avatar Feb 12 '24 21:02 generall

From what i understand no peer is down, they are all up, GET /readyz endpoint responds with 200 OK on all of them. Only shard id 3 on 6358360577973509 is marked as Dead.

azhelev avatar Feb 13 '24 05:02 azhelev

hm, I don't see a reason why request would fail in this configuration. Also interesting that shard recovery is not initialized

generall avatar Feb 13 '24 20:02 generall

Same here: GET /cluster

{
  "result": {
    "status": "enabled",
    "peer_id": 399284531266390,
    "peers": {
      "3623990355959938": {
        "uri": "http://urlslab-qdrant-1.urlslab-qdrant-headless:6335/"
      },
      "3536642733441919": {
        "uri": "http://urlslab-qdrant-2.urlslab-qdrant-headless:6335/"
      },
      "399284531266390": {
        "uri": "http://urlslab-qdrant-0.urlslab-qdrant-headless:6335/"
      }
    },
    "raft_info": {
      "term": 29,
      "commit": 48609,
      "pending_operations": 0,
      "leader": 399284531266390,
      "role": "Leader",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2024-02-18T07:58:01.971770346Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.00001001
}

Qdrant Server logs:

qdrant 2024-02-18T07:59:37.044781Z  WARN storage::content_manager::consensus_manager: Failed to apply collection meta operation entry with user error: Wrong input: Cannot deactivate the last active replica 3536642733441919 of shard 18

in Query time:

Service internal error: The replica set for shard 10 on peer 399284531266390 has no active replica

yasha-dev1 avatar Feb 18 '24 08:02 yasha-dev1

@generall , I think a part of the issue lies in the fact that the shards don't get restarted correctly when they are dead. When I restarted all qdrant nodes, everything went back to normal again. but then, the same issue happens after a while... there could be a bug in shard recovery

yasha-dev1 avatar Feb 19 '24 06:02 yasha-dev1

could you please describe a scenario when this happened?

generall avatar Feb 19 '24 09:02 generall

We also experience this bug using a distributed qdrant deployment. When we upload a collection from a snapshot, the collection exists on all nodes but does only contain points on the node that processed the upload

tZimmermann98 avatar May 10 '24 15:05 tZimmermann98

When we upload a collection from a snapshot, the collection exists on all nodes but does only contain points on the node that processed the upload

Please make sure you are following the steps of the tutorial correctly - https://qdrant.tech/documentation/tutorials/create-snapshot/

Especially, that you are using ?priority=snapshot parameter on recovery

generall avatar May 10 '24 18:05 generall

Are there any Plans to implement a way qdrant keeps aware of distributing data over the nodes itself, so you have to upload a collection only once to a dashboard or single API endpoint? We are deploying QDrant in a Kubernetes cluster, the dashboard is behind a Service that has routes to the Endpoints of the pods, so we dont have Control over the traffic routing - we can communicate with the single pods inside the cluster, but not from outside. Would be nice if qdrant is able to handle that, just like distributed database systems like Stolon.

JWandscheer avatar May 17 '24 10:05 JWandscheer