milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Failed to load data after inserting it

Open ingale726 opened this issue 2 years ago • 17 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.2.2 
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.2
- OS(Ubuntu or CentOS): centos
- CPU/Memory: 
ip:172.16.75.171              coord
系统:centos7
CPU型号:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
核数:24核
内存:92G
-----------------------------------------------------
ip:172.16.75.172            node
系统:centos7
CPU型号:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
核数:96核
内存:366G
-----------------------------------------------------
ip:172.16.75.173          etcd minio pulsar
系统:centos7
CPU型号:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
核数:48核
内存:183G
- GPU: Tesla T4
- Others:

Current Behavior

load ---------fail

Expected Behavior

After queue congestion occurs and the backlog is cleared, data can be loaded into memory

Steps To Reproduce

Create a collection

Create index

Insert data

stuck

Clear the queue backlog

Insert data

Load ----------0% 

Rebuild index

Load -----------0%

Milvus Log

链接:https://pan.baidu.com/s/1hnOQGNDMR0O1M79IAPFBjg 提取码:6m4x

Anything else?

No response

ingale726 avatar Feb 24 '23 13:02 ingale726

pls offer detailed logs so we can take a deep look into it

xiaofan-luan avatar Feb 24 '23 14:02 xiaofan-luan

I have uploaded it

ingale726 avatar Feb 24 '23 14:02 ingale726

from the log it seems the cluster works as expect. Only suggestion is to turn log level to INFO rather than warn. Are u still under stuck stage? or do we have log when stuck?

xiaofan-luan avatar Feb 24 '23 15:02 xiaofan-luan

new logs https://www.aliyundrive.com/s/8baLiTM6TE7 The problem still exists. I extracted 500 minutes of LOG, and I hope you can find the reason the problem: After inserting the data, the data could not be loaded and the HNSW index was created

index_params = {
  "metric_type":"IP",
  "index_type":"HNSW",
  "params":{"M":32,"efConstruction":128},
}
collection = Collection("long_128")      # Get an existing collection.
collection.drop_index()
collection.create_index(
  field_name="embeddings",
  index_params=index_params
)

About 110 million pieces of data were inserted ##################################################################### The code to create the collection:

fields = [
    FieldSchema(name="index", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="string", dtype=DataType.VARCHAR,max_length=128),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields, "hello_milvus is the simplest demo to introduce the APIs")
hello_milvus = Collection("long_128", schema)

@xiaofan-luan

ingale726 avatar Feb 26 '23 16:02 ingale726

@xiao12mm how to get the new logs? in previous attached logs, i did not find a collection named "long_128", but "long_256" and "q_128". /assign @xiao12mm could you please double check the etcd service and pulsar service are running well? as the logs below indicate that milvus fails to get some meta info of segments.

[2023/02/24 13:19:41.843 +00:00] [ERROR] [meta/coordinator_broker.go:151] ["failed to get segment info from DataCoord"] 
[2023/02/24 13:19:41.855 +00:00] [WARN] [task/executor.go:411] ["failed to subscribe DmChannel, failed to fill the request with segments"] [taskID=46485] [collectionID=439629368484968883] [channel=by-dev-rootcoord-dml_20_439629368484968883v0] [node=246] [source=1] [error="context deadline exceeded"]

yanliang567 avatar Feb 27 '23 01:02 yanliang567

https://www.aliyundrive.com/s/8baLiTM6TE7 Please download it

ingale726 avatar Feb 27 '23 01:02 ingale726

@yanliang567

ingale726 avatar Feb 27 '23 01:02 ingale726

Now when I insert with a single thread, I can load it, when I insert with multiple threads, I can't load it, so I think there's an improvement here

ingale726 avatar Mar 01 '23 01:03 ingale726

@xiao12mm how did you insert with multiple threads? could you please share the code snippet for us to reproduce the issue in house?

yanliang567 avatar Mar 01 '23 03:03 yanliang567

#encoding:utf-8 

import gensim
import time
import os
import numpy
import random
from pymilvus import (
    connections,
    utility,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
)
import numpy as np
import sys




if __name__ == "__main__":
    connections.connect("default", host="172.16.75.171", port="19530")
    hello_milvus = Collection("long_128_1")
    start_time = time.time()

    new_list = ['120-129.txt','130-139.txt','140-149.txt']
    jid_list = []
    embedding_list = []
    total_list = []
    count_len = 10000
    for txt_name in new_list:
        num = 0
        with open('./' + txt_name,'r',encoding='utf-8') as f:
            while True:
                num += 1
                print(txt_name,num)
                content = f.readline()
                if content == '':
                    break
                line = content.strip()
                if line == "":
                    continue
                if len(line.strip().split('\t',1)) != 2:
                    continue
                jid = line.strip().split('\t',1)[0]
                embedding = line.strip().split('\t',1)[1]
                if len(eval(embedding)) != 128:
                    continue
                jid_list.append(jid)
                embedding_list.append(eval(embedding))
                if len(jid_list) < count_len:
                    continue
                total_list.append(jid_list)
                total_list.append(embedding_list)
                hello_milvus.insert(total_list)
                hello_milvus.flush()
                jid_list = []
                embedding_list = []
                total_list = []
    total_list.append(jid_list)
    total_list.append(embedding_list)
    hello_milvus.insert(total_list)
    hello_milvus.flush()
    end_time = time.time()
    with open('./run_time.txt','a',encoding='utf-8') as f2:
        f2.write('插入花费时间:' + str(end_time - start_time) + '\n')



ingale726 avatar Mar 01 '23 03:03 ingale726

I used five such threads to insert, all the same except new_list

ingale726 avatar Mar 01 '23 03:03 ingale726

I'm using 5 threads, and the theoretical maximum amount of data inserted at the same time is up to 50,000. Maybe that's the problem here

ingale726 avatar Mar 01 '23 03:03 ingale726

okay, that makes sense. Multi threads inserting makes much workload on pulsar, which make it slower and slower. Referring to the test results in house, multi thread inserting does not increase the insert throughput, and one thread with 50k~100k is the best size of a batch in inserting.

yanliang567 avatar Mar 01 '23 09:03 yanliang567

don't think that really make sense. But I'm doubting this is fully due to OOM. You can setup to larger proxy memory limit and try

xiaofan-luan avatar Mar 09 '23 04:03 xiaofan-luan

How do I set up more agent memory?

ingale726 avatar Mar 09 '23 08:03 ingale726

How do I set up more agent memory?

how did you deploy Milvus? generally, you can set more resource request and limit in milvus yaml file. https://milvus.io/docs/v2.3.0-beta/install_cluster-milvusoperator.md

yanliang567 avatar Mar 22 '23 01:03 yanliang567

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Apr 21 '23 06:04 stale[bot]