milvus
milvus copied to clipboard
[Bug]: Milvus 2.1 crashed while inserting data
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.1
- Deployment mode(standalone or cluster): standalon2
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1
- OS(Ubuntu or CentOS): ubuntu docker
- CPU/Memory: 120g
- GPU: None
- Others:
Current Behavior
milvus crashed while inserting data. The error log is attached. Only the milvus container exited, etcd and minio are fine.
Logs indicate disconnection from etcd but etcd is working fine. [2022/08/04 07:22:57.289 +00:00] [ERROR] [indexnode/indexnode.go:135] ["Index Node disconnected from etcd, process will exit"] ["Server Id"=5] [stack="github.com/milvus-io/milvus/internal/indexnode.(*IndexNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/indexnode.go:135"]
[2022/08/04 07:22:57.289 +00:00] [ERROR] [datanode/data_node.go:182] ["Data Node disconnected from etcd, process will exit"] ["Server Id"=7] [stack="github.com/milvus-io/milvus/internal/datanode.(*DataNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/data_node.go:182"]
[2022/08/04 07:22:57.289 +00:00] [ERROR] [indexcoord/index_coord.go:129] ["Index Coord disconnected from etcd, process will exit"] ["Server Id"=4] [stack="github.com/milvus-io/milvus/internal/indexcoord.(*IndexCoord).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexcoord/index_coord.go:129"]
Expected Behavior
No response
Steps To Reproduce
Just insert data
We have reproduced this situation twice so far.
Our insertion method is to insert 100000 128-dimensional data and custom ids at a time.
Crashes after several successful insertions.
Milvus Log
Anything else?
No response
@owlwang thank you for the issue. one quick question, how did you deploy milvus, and how much(cpu and memory) did you request for milvus pod(container)? /assign @owlwang /unassign
official document default settings
https://milvus.io/docs/v2.1.x/install_standalone-docker.md
did you limit cpu or memory resource to milvus container?
No
/assign @soothing-rain /unassign
@owlwang can you provide full logs?
@soothing-rain The test environment has been deleted. I'll try to reproduce it now, and if it works I'll post the logs.
Hi @owlwang Can you describe the schema of the collection where you are inserting the data, included collection shard numbers
Hi @ThreadDao
We are using the doc's standalone config file https://milvus.io/docs/v2.1.x/install_standalone-docker.md. There are no resource restrictions or modifications for docker. shard should has only one
Here is the code for insertion
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=128)
]
class InsertQ:
def __init__(self):
self.ids_buf = []
self.vecs_buf = []
self.buffer_size = 100000
def add(self, m_id, m_vec):
if m_id and m_vec:
self.ids_buf.append(m_id)
self.vecs_buf.append(m_vec)
if len(self.ids_buf) >= self.buffer_size: # buffer size
self.flush()
def flush(self):
print(f'committing to db {len(self.ids_buf)}')
entities = [
self.ids_buf,
self.vecs_buf, # field random, only supports list
]
result = milvus_partition.insert(entities)
print(result)
self.ids_buf = []
self.vecs_buf = []
print(f'finished commit')
def cleanup(self):
if len(self.ids_buf):
self.flush()
print(f'cleanup')
Just now it crashed again
Full log milvus_issue_18530_dockerlog.log.tar.gz
python client message:
[has_partition] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses>
[has_partition] retry:5, cost: 0.81s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses>
[has_partition] retry:6, cost: 2.43s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses>
RPC error: [has_partition], <MilvusException: (code=1, message=proxy not healthy)>, <Time:{'RPC start': '2022-08-05 15:56:16.677351', 'RPC error': '2022-08-05 15:59:17.724814'}>
Traceback (most recent call last):
File "insert_21.py", line 211, in <module>
insert_lst(sys.argv[1])
File "insert_21.py", line 173, in insert_lst
insert_single_feats_file(np_filename)
File "insert_21.py", line 164, in insert_single_feats_file
insert_queue.add(m_id, vec)
File "insert_21.py", line 76, in add
self.flush()
File "insert_21.py", line 86, in flush
result = milvus_partition.insert(entities)
File "/media/data2/fernando/milvus_21_tools/pymilvus/orm/partition.py", line 280, in insert
if conn.has_partition(self._collection.name, self._name) is False:
File "/media/data2/fernando/milvus_21_tools/pymilvus/decorators.py", line 96, in handler
raise e
File "/media/data2/fernando/milvus_21_tools/pymilvus/decorators.py", line 92, in handler
return func(*args, **kwargs)
File "/media/data2/fernando/milvus_21_tools/pymilvus/decorators.py", line 74, in handler
raise e
File "/media/data2/fernando/milvus_21_tools/pymilvus/decorators.py", line 48, in handler
return func(self, *args, **kwargs)
File "/media/data2/fernando/milvus_21_tools/pymilvus/client/grpc_handler.py", line 274, in has_partition
raise MilvusException(status.error_code, status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=proxy not healthy)>
Notice: Here I create a partition and use milvus_partition to insert data
milvus_partition = milvus_collection.create_partition(partition_name, f"{partition_name} data") milvus_partition.insert(entities)
It looks like that you have increased the segment size to 16GB from your post on discussion: https://github.com/milvus-io/milvus/discussions/18512
So it appears that a huge compaction happens that took over 3 minutes.
{"log":"[2022/08/05 07:56:14.767 +00:00] [DEBUG] [datanode/compactor.go:350] [\"compaction start\"] [planID=435076772581343233] [\"timeout in seconds\"=180]\n","stream":"stdout","time":"2022-08-05T07:56:14.767188467Z"}
"log":"[2022/08/05 07:59:47.186 +00:00] [INFO] [datanode/compaction_executor.go:107] [\"end to execute compaction\"] [planID=435076772581343233]
As a result, the whole Milvus standalone processed was stuck -> Milvus cannot renew Etcd lease -> the whole Milvus process started to shut down.
I'd suggest that you try distributed version of Milvus and with multiple DataNode processes.
In the future, you can use our bulkload tool to insert data into Milvus, which can eliminate these kind of issues.
Thank you for your efforts and response. It looks like you have located the problem By the way, this also happens with default parameters i.e. maxsize = 512M and maxsize=4096. So it looks like the standalone version will have a risk of blocking and crashed when inserted in large batches. I think we can add this tip to the documentation. Finally, thank you again for your efforts and have a nice weekend.
Thank you for your efforts and response. It looks like you have located the problem By the way, this also happens with default parameters i.e. maxsize = 512M and maxsize=4096. So it looks like the standalone version will have a risk of blocking and crashed when inserted in large batches. I think we can add this tip to the documentation. Finally, thank you again for your efforts and have a nice weekend.
Thanks, this should not happen tho :) Do you happen to have any monitoring setup (CPU/memory usage, etc.) for your Milvus service? And may I ask how much memory do you have on your machine?
No,We don't do monitoring for this, we have 128G memory
Per offline discussion, it is mainly caused by Etcd operations being slow.
We usually suggest that users build/use their own dedicated Etcd cluster (with SSD) to reach the best performance. Meanwhile we are doing some optimizations in the next couple versions to reduce the load on Etcd.
/unassign /assign @yanliang567
@owlwang could you please retry with etcd cluster on ssd as suggested above? /assign @owlwang /unassign
Thank you very much for your hard work to make such a good product.
But, sorry, there are no more tests planned.
As your colleague said etcd is blocking and causing the whole system to crash. ["Slow etcd operation save"] ["time spent"=47.075233364s]"
For us it would be very wasteful to have a separate machine dedicated to etcd. For us, we don't hope to give so much resources and maintenance resources for a standalone system.
We don't know why a single thread inserting data on a 40 Core + 128G Memory + SSD machine can crash Milvus standalone and block ETCD.
We've been in touch with your colleagues and are waiting for a beta version of your cloud service.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.