milvus
milvus copied to clipboard
[Bug]: Failed to load data after inserting it
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.2.2
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.2
- OS(Ubuntu or CentOS): centos
- CPU/Memory:
ip:172.16.75.171 coord
系统:centos7
CPU型号:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
核数:24核
内存:92G
-----------------------------------------------------
ip:172.16.75.172 node
系统:centos7
CPU型号:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
核数:96核
内存:366G
-----------------------------------------------------
ip:172.16.75.173 etcd minio pulsar
系统:centos7
CPU型号:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
核数:48核
内存:183G
- GPU: Tesla T4
- Others:
Current Behavior
load ---------fail
Expected Behavior
After queue congestion occurs and the backlog is cleared, data can be loaded into memory
Steps To Reproduce
Create a collection
Create index
Insert data
stuck
Clear the queue backlog
Insert data
Load ----------0%
Rebuild index
Load -----------0%
Milvus Log
链接:https://pan.baidu.com/s/1hnOQGNDMR0O1M79IAPFBjg 提取码:6m4x
Anything else?
No response
pls offer detailed logs so we can take a deep look into it
I have uploaded it
from the log it seems the cluster works as expect. Only suggestion is to turn log level to INFO rather than warn. Are u still under stuck stage? or do we have log when stuck?
new logs
https://www.aliyundrive.com/s/8baLiTM6TE7
The problem still exists. I extracted 500 minutes of LOG, and I hope you can find the reason
the problem:
After inserting the data, the data could not be loaded and the HNSW index was created
index_params = {
"metric_type":"IP",
"index_type":"HNSW",
"params":{"M":32,"efConstruction":128},
}
collection = Collection("long_128") # Get an existing collection.
collection.drop_index()
collection.create_index(
field_name="embeddings",
index_params=index_params
)
About 110 million pieces of data were inserted ##################################################################### The code to create the collection:
fields = [
FieldSchema(name="index", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="string", dtype=DataType.VARCHAR,max_length=128),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields, "hello_milvus is the simplest demo to introduce the APIs")
hello_milvus = Collection("long_128", schema)
@xiaofan-luan
@xiao12mm how to get the new logs? in previous attached logs, i did not find a collection named "long_128", but "long_256" and "q_128". /assign @xiao12mm could you please double check the etcd service and pulsar service are running well? as the logs below indicate that milvus fails to get some meta info of segments.
[2023/02/24 13:19:41.843 +00:00] [ERROR] [meta/coordinator_broker.go:151] ["failed to get segment info from DataCoord"]
[2023/02/24 13:19:41.855 +00:00] [WARN] [task/executor.go:411] ["failed to subscribe DmChannel, failed to fill the request with segments"] [taskID=46485] [collectionID=439629368484968883] [channel=by-dev-rootcoord-dml_20_439629368484968883v0] [node=246] [source=1] [error="context deadline exceeded"]
https://www.aliyundrive.com/s/8baLiTM6TE7 Please download it
@yanliang567
Now when I insert with a single thread, I can load it, when I insert with multiple threads, I can't load it, so I think there's an improvement here
@xiao12mm how did you insert with multiple threads? could you please share the code snippet for us to reproduce the issue in house?
#encoding:utf-8
import gensim
import time
import os
import numpy
import random
from pymilvus import (
connections,
utility,
FieldSchema,
CollectionSchema,
DataType,
Collection,
)
import numpy as np
import sys
if __name__ == "__main__":
connections.connect("default", host="172.16.75.171", port="19530")
hello_milvus = Collection("long_128_1")
start_time = time.time()
new_list = ['120-129.txt','130-139.txt','140-149.txt']
jid_list = []
embedding_list = []
total_list = []
count_len = 10000
for txt_name in new_list:
num = 0
with open('./' + txt_name,'r',encoding='utf-8') as f:
while True:
num += 1
print(txt_name,num)
content = f.readline()
if content == '':
break
line = content.strip()
if line == "":
continue
if len(line.strip().split('\t',1)) != 2:
continue
jid = line.strip().split('\t',1)[0]
embedding = line.strip().split('\t',1)[1]
if len(eval(embedding)) != 128:
continue
jid_list.append(jid)
embedding_list.append(eval(embedding))
if len(jid_list) < count_len:
continue
total_list.append(jid_list)
total_list.append(embedding_list)
hello_milvus.insert(total_list)
hello_milvus.flush()
jid_list = []
embedding_list = []
total_list = []
total_list.append(jid_list)
total_list.append(embedding_list)
hello_milvus.insert(total_list)
hello_milvus.flush()
end_time = time.time()
with open('./run_time.txt','a',encoding='utf-8') as f2:
f2.write('插入花费时间:' + str(end_time - start_time) + '\n')
I used five such threads to insert, all the same except new_list
I'm using 5 threads, and the theoretical maximum amount of data inserted at the same time is up to 50,000. Maybe that's the problem here
okay, that makes sense. Multi threads inserting makes much workload on pulsar, which make it slower and slower. Referring to the test results in house, multi thread inserting does not increase the insert throughput, and one thread with 50k~100k is the best size of a batch in inserting.
don't think that really make sense. But I'm doubting this is fully due to OOM. You can setup to larger proxy memory limit and try
How do I set up more agent memory?
How do I set up more agent memory?
how did you deploy Milvus? generally, you can set more resource request and limit in milvus yaml file. https://milvus.io/docs/v2.3.0-beta/install_cluster-milvusoperator.md
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.