milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Milvus 2.1 crashed while inserting data

Open owlwang opened this issue 3 years ago • 15 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.1
- Deployment mode(standalone or cluster): standalon2
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1
- OS(Ubuntu or CentOS): ubuntu docker
- CPU/Memory: 120g
- GPU: None
- Others:

Current Behavior

milvus crashed while inserting data. The error log is attached. Only the milvus container exited, etcd and minio are fine.

Logs indicate disconnection from etcd but etcd is working fine. [2022/08/04 07:22:57.289 +00:00] [ERROR] [indexnode/indexnode.go:135] ["Index Node disconnected from etcd, process will exit"] ["Server Id"=5] [stack="github.com/milvus-io/milvus/internal/indexnode.(*IndexNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexnode/indexnode.go:135"]

[2022/08/04 07:22:57.289 +00:00] [ERROR] [datanode/data_node.go:182] ["Data Node disconnected from etcd, process will exit"] ["Server Id"=7] [stack="github.com/milvus-io/milvus/internal/datanode.(*DataNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/data_node.go:182"]

[2022/08/04 07:22:57.289 +00:00] [ERROR] [indexcoord/index_coord.go:129] ["Index Coord disconnected from etcd, process will exit"] ["Server Id"=4] [stack="github.com/milvus-io/milvus/internal/indexcoord.(*IndexCoord).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/indexcoord/index_coord.go:129"]

Expected Behavior

No response

Steps To Reproduce

Just insert data

We have reproduced this situation twice so far.

Our insertion method is to insert 100000 128-dimensional data and custom ids at a time.

Crashes after several successful insertions.


Milvus Log

log.log

Anything else?

No response

owlwang avatar Aug 04 '22 07:08 owlwang

@owlwang thank you for the issue. one quick question, how did you deploy milvus, and how much(cpu and memory) did you request for milvus pod(container)? /assign @owlwang /unassign

yanliang567 avatar Aug 04 '22 08:08 yanliang567

official document default settings

https://milvus.io/docs/v2.1.x/install_standalone-docker.md

owlwang avatar Aug 04 '22 08:08 owlwang

did you limit cpu or memory resource to milvus container?

yanliang567 avatar Aug 04 '22 10:08 yanliang567

No

owlwang avatar Aug 05 '22 02:08 owlwang

/assign @soothing-rain /unassign

yanliang567 avatar Aug 05 '22 02:08 yanliang567

@owlwang can you provide full logs?

soothing-rain avatar Aug 05 '22 06:08 soothing-rain

@soothing-rain The test environment has been deleted. I'll try to reproduce it now, and if it works I'll post the logs.

owlwang avatar Aug 05 '22 07:08 owlwang

Hi @owlwang Can you describe the schema of the collection where you are inserting the data, included collection shard numbers

ThreadDao avatar Aug 05 '22 07:08 ThreadDao

Hi @ThreadDao

We are using the doc's standalone config file https://milvus.io/docs/v2.1.x/install_standalone-docker.md. There are no resource restrictions or modifications for docker. shard should has only one

Here is the code for insertion

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=128)
]
class InsertQ:
    def __init__(self):
        self.ids_buf = []
        self.vecs_buf = []
        self.buffer_size = 100000

    def add(self, m_id, m_vec):
        if m_id and m_vec:
            self.ids_buf.append(m_id)
            self.vecs_buf.append(m_vec)
        if len(self.ids_buf) >= self.buffer_size:  # buffer size
            self.flush()

    def flush(self):
        print(f'committing to db {len(self.ids_buf)}')

        entities = [
            self.ids_buf,
            self.vecs_buf,  # field random, only supports list
        ]

        result = milvus_partition.insert(entities)
        print(result)

        self.ids_buf = []
        self.vecs_buf = []
        print(f'finished commit')

    def cleanup(self):
        if len(self.ids_buf):
            self.flush()
        print(f'cleanup')

owlwang avatar Aug 05 '22 07:08 owlwang

Just now it crashed again

Full log milvus_issue_18530_dockerlog.log.tar.gz

python client message:

[has_partition] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses>
[has_partition] retry:5, cost: 0.81s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses>
[has_partition] retry:6, cost: 2.43s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses>
RPC error: [has_partition], <MilvusException: (code=1, message=proxy not healthy)>, <Time:{'RPC start': '2022-08-05 15:56:16.677351', 'RPC error': '2022-08-05 15:59:17.724814'}>
Traceback (most recent call last):
  File "insert_21.py", line 211, in <module>
    insert_lst(sys.argv[1])
  File "insert_21.py", line 173, in insert_lst
    insert_single_feats_file(np_filename)
  File "insert_21.py", line 164, in insert_single_feats_file
    insert_queue.add(m_id, vec)
  File "insert_21.py", line 76, in add
    self.flush()
  File "insert_21.py", line 86, in flush
    result = milvus_partition.insert(entities)
  File "/media/data2/fernando/milvus_21_tools/pymilvus/orm/partition.py", line 280, in insert
    if conn.has_partition(self._collection.name, self._name) is False:
  File "/media/data2/fernando/milvus_21_tools/pymilvus/decorators.py", line 96, in handler
    raise e
  File "/media/data2/fernando/milvus_21_tools/pymilvus/decorators.py", line 92, in handler
    return func(*args, **kwargs)
  File "/media/data2/fernando/milvus_21_tools/pymilvus/decorators.py", line 74, in handler
    raise e
  File "/media/data2/fernando/milvus_21_tools/pymilvus/decorators.py", line 48, in handler
    return func(self, *args, **kwargs)
  File "/media/data2/fernando/milvus_21_tools/pymilvus/client/grpc_handler.py", line 274, in has_partition
    raise MilvusException(status.error_code, status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=proxy not healthy)>

owlwang avatar Aug 05 '22 08:08 owlwang

Notice: Here I create a partition and use milvus_partition to insert data

milvus_partition = milvus_collection.create_partition(partition_name, f"{partition_name} data") milvus_partition.insert(entities)

owlwang avatar Aug 05 '22 10:08 owlwang

It looks like that you have increased the segment size to 16GB from your post on discussion: https://github.com/milvus-io/milvus/discussions/18512

So it appears that a huge compaction happens that took over 3 minutes.

{"log":"[2022/08/05 07:56:14.767 +00:00] [DEBUG] [datanode/compactor.go:350] [\"compaction start\"] [planID=435076772581343233] [\"timeout in seconds\"=180]\n","stream":"stdout","time":"2022-08-05T07:56:14.767188467Z"}
"log":"[2022/08/05 07:59:47.186 +00:00] [INFO] [datanode/compaction_executor.go:107] [\"end to execute compaction\"] [planID=435076772581343233]

As a result, the whole Milvus standalone processed was stuck -> Milvus cannot renew Etcd lease -> the whole Milvus process started to shut down.

I'd suggest that you try distributed version of Milvus and with multiple DataNode processes.

In the future, you can use our bulkload tool to insert data into Milvus, which can eliminate these kind of issues.

soothing-rain avatar Aug 06 '22 16:08 soothing-rain

Thank you for your efforts and response. It looks like you have located the problem By the way, this also happens with default parameters i.e. maxsize = 512M and maxsize=4096. So it looks like the standalone version will have a risk of blocking and crashed when inserted in large batches. I think we can add this tip to the documentation. Finally, thank you again for your efforts and have a nice weekend.

owlwang avatar Aug 06 '22 17:08 owlwang

Thank you for your efforts and response. It looks like you have located the problem By the way, this also happens with default parameters i.e. maxsize = 512M and maxsize=4096. So it looks like the standalone version will have a risk of blocking and crashed when inserted in large batches. I think we can add this tip to the documentation. Finally, thank you again for your efforts and have a nice weekend.

Thanks, this should not happen tho :) Do you happen to have any monitoring setup (CPU/memory usage, etc.) for your Milvus service? And may I ask how much memory do you have on your machine?

soothing-rain avatar Aug 08 '22 09:08 soothing-rain

No,We don't do monitoring for this, we have 128G memory

owlwang avatar Aug 08 '22 10:08 owlwang

Per offline discussion, it is mainly caused by Etcd operations being slow.

We usually suggest that users build/use their own dedicated Etcd cluster (with SSD) to reach the best performance. Meanwhile we are doing some optimizations in the next couple versions to reduce the load on Etcd.

soothing-rain avatar Aug 17 '22 09:08 soothing-rain

/unassign /assign @yanliang567

soothing-rain avatar Aug 17 '22 09:08 soothing-rain

@owlwang could you please retry with etcd cluster on ssd as suggested above? /assign @owlwang /unassign

yanliang567 avatar Aug 17 '22 11:08 yanliang567

Thank you very much for your hard work to make such a good product.

But, sorry, there are no more tests planned.

As your colleague said etcd is blocking and causing the whole system to crash. ["Slow etcd operation save"] ["time spent"=47.075233364s]"

For us it would be very wasteful to have a separate machine dedicated to etcd. For us, we don't hope to give so much resources and maintenance resources for a standalone system.

We don't know why a single thread inserting data on a 40 Core + 128G Memory + SSD machine can crash Milvus standalone and block ETCD.

We've been in touch with your colleagues and are waiting for a beta version of your cloud service.

owlwang avatar Aug 18 '22 12:08 owlwang

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 17 '22 14:09 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Oct 31 '22 02:10 stale[bot]