guoyande
guoyande
1、数据备份这块vearch暂时不支持。 2、重启后你可以调用health接口,看看两个集群的数据量是否相同; 3、你是调用这个curl -XGET http://router_server/$db_name/$space_name/$id 接口返回查询图片失败的吗?
通常我们这样做,新创建一个集群,在新集群建表,建表参数和老集群,partition数量和老集群一样。kill掉ps,将老集群各partition的数据拷贝到新集群,partition id要对得上。然后重启。
get失败是不是因为你的partition id没对应上。
建议使用3.2.5或者3.2.7版本,3.1.0和3.2.0这两个版本有点老,暂时不维护了。
可以参考一下这个文档,https://github.com/vearch/vearch/blob/master/docs/APILowLevel.md 。这个文档 https://vearch.readthedocs.io/zh_CN/latest/ 介绍的还是比较全面,如果发现有出入的地方敬请指出,我们修改一下。
> ``` > [{"doc_num":32785394,"size":587879328,"spaces":[{"partitions":[{"pid":3,"doc_num":19988365,"size":299531544,"replica_num":3,"path":"/data/rrjia/vearch_datas/","unreachable":[2],"status":4,"color":"yellow","ip":"172.31.169.102","node_id":3,"raft_status":{"ID":3,"NodeID":3,"Leader":3,"Term":2,"Index":19988366,"Commit":19988366,"Applied":19988366,"Vote":3,"PendQueue":0,"RecvQueue":0,"AppQueue":0,"Stopped":false,"RestoringSnapshot":false,"State":"StateLeader","Replicas":{"3":{"Match":19988366,"Commit":19988366,"Next":19988367,"State":"ReplicaStateProbe","Snapshoting":false,"Paused":false,"Active":false,"LastActive":1632391852105841436,"Inflight":0},"1":{"Match":19988366,"Commit":19988366,"Next":19988367,"State":"ReplicaStateReplicate","Snapshoting":false,"Paused":false,"Active":true,"LastActive":1633653906105116793,"Inflight":0},"2":{"Match":19987718,"Commit":19987718,"Next":19987847,"State":"ReplicaStateReplicate","Snapshoting":false,"Paused":false,"Active":false,"LastActive":1633171570943107484,"Inflight":128}}},"index_status":2},{"pid":4,"replica_num":3,"color":"red","ip":"172.31.169.102","node_id":3,"index_status":0},{"pid":5,"doc_num":12797029,"size":288347784,"replica_num":3,"path":"/data/rrjia/vearch_datas/","status":3,"color":"green","ip":"172.31.161.77","node_id":1,"raft_status":{"ID":5,"NodeID":1,"Leader":2,"Term":5,"Index":12797033,"Commit":12797033,"Applied":12797033,"Vote":2,"PendQueue":0,"RecvQueue":0,"AppQueue":0,"Stopped":false,"RestoringSnapshot":false,"State":"StateFollower","Replicas":{}},"index_status":2}],"status":"red","name":"image_table","partition_num":3,"replica_num":9,"doc_num":32785394,"size":587879328}],"status":"red","errors":["query space:[image_table] server:[3] partition:[4] info err :[partition_is_closed]"],"db_name":"image_db","space_num":1}] > ``` > > 请问这种集群机器yellow或者red应该如何排查问题解决? 这个是调用health接口的返回结果,根据返回信息应该是有节点挂了。导致写入数据失败。
> 将集群全部关闭后重启失败,其中一台机器日志信息如下 > > ``` > 2021-10-08 15:55:10.738508 W | etcdserver: read-only range request "key:\"/lock/watch/server/\" range_end:\"/lock/watch/server0\" limit:1 sort_order:DESCEND sort_target:MOD max_mod_revision:315 " with result "range_response_count:0 size:7" took too long (244.300525ms) to...
> > 用health接口的返回结果,根据返回信息应该是有节点挂了。导致写入数据失败 > > 请问节点挂了后,应该如何恢复? 正常情况下ps节点挂了以后直接重启就可以恢复。你可以使用github最新的代码进行重启测试。
标量索引存在线程安全问题的bug,最近会提修复的代码。
上周和你们进行了沟通,请问使用新发布的版本测试还存在该现象吗?