milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: inserting 1000 documents, milvus crashes and unavailable.

Open Gy1900 opened this issue 9 months ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.3.3
- Deployment mode(standalone or cluster): cluster-k8s
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.5
- OS(Ubuntu or CentOS):  kylin v10+x86
- CPU/Memory: master 8c8g*3, node 8c12g*3
- GPU: no
- Others:

Current Behavior

During the stress test, the streaming insertion of 1000 vectorized PDFs into collection_1 was successful. However, when inserting the second 1000 documents, milvus crashed and became unavailable, and the data preview of attu timed out. Can retrieve whether collection_other exists, but an insert error is reported in collection_1.

error: info: Traceback (most recent call last): File "python3.9/site-packages/pymilvus/decorators.py", line 50, in handler return func(self, *args, **kwargs) File "python3.9/site-packages/pymilvus/client/grpc_handler.py", line 399, in batch_insert raise err File "python3.9/site-packages/pymilvus/client/grpc_handler.py", line 389, in batch_insert response = rf.result() File "python3.9/site-packages/grpc/_channel.py", line 797, in result raise self grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4}"

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "milvus_store.py", line 259, in _add_documents res = self.col.insert( File "python3.9/site-packages/pymilvus/orm/collection.py", line 430, in insert res = conn.batch_insert(self._name, entities, partition_name, File "python3.9/site-packages/pymilvus/decorators.py", line 109, in handler raise e File "python3.9/site-packages/pymilvus/decorators.py", line 105, in handler return func(*args, **kwargs) File "python3.9/site-packages/pymilvus/decorators.py", line 136, in handler ret = func(self, *args, **kwargs) File "python3.9/site-packages/pymilvus/decorators.py", line 64, in handler raise MilvusException(message=f"rpc deadline exceeded: {timeout_msg}") from e pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=rpc deadline exceeded: Retry timeout: 30s)>

code: for xxxx in documents: self.col.insert() self.col.flush()

Expected Behavior

How much storage can I have in my configured cluster deployment, 3master and 3node? If the stock is not enough, I hope the prompt will be more direct. At least 3000 documents can be inserted k8s-pods

Steps To Reproduce

No response

Milvus Log

pod_log.zip

Anything else?

No response

Gy1900 avatar May 09 '24 07:05 Gy1900

How should this problem be solved? I see in the log: [bookkeeper-io-3-13] ERROR org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl - Unable to allocate memory io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 67108864, max: 67108864)

Gy1900 avatar May 09 '24 07:05 Gy1900

After milvus crashes, restart all nodes and it can be used normally.

Gy1900 avatar May 09 '24 07:05 Gy1900

/assign @congqixia /unassign

yanliang567 avatar May 10 '24 01:05 yanliang567

能否告知一下到底大概是哪里的问题,因为是生产上的,有点等不及,谢谢。@yanliang567

Gy1900 avatar May 15 '24 07:05 Gy1900

@Gy1900 from the log you provided, we could not find why the milvus crashed. Looks like there was some input stream cannot get any messages. It could be a known issue for pulsar client that milvus pod must be restarted after the pulsar went into readonly mode due to out of disk space.

congqixia avatar May 15 '24 08:05 congqixia

@Gy1900从您提供的日志中,我们无法找到 milvus 崩溃的原因。看起来有些输入流无法获取任何消息。 对于 pulsar 客户端来说,由于磁盘空间不足而进入只读模式后,必须重新启动 milvus pod,这可能是 pulsar 客户端的一个已知问题。

那对于pulsar,我应该怎么做?换成rocksmq,还有更简单的方法吗? pulsar空间不足,是否会导致 milvus崩溃?

Gy1900 avatar May 15 '24 09:05 Gy1900

congqixia

@congqixia congqixia

Gy1900 avatar May 15 '24 09:05 Gy1900

@Gy1900 Could you double check the bookie memory setting? It's recommended to set 4G as heap and 8G for direct memory like this in the pulsar-bookie configmap:

  PULSAR_MEM: |
    -Xms4096m -Xmx4096m -XX:MaxDirectMemorySize=8192m

You can change the configmap then restart the pulsar bookie pods one by one.

LoveEachDay avatar May 15 '24 09:05 LoveEachDay

@Gy1900你能仔细检查一下 bookie 内存设置吗?建议在 pulsar-bookie configmap 中将 4G 设置为堆,将 8G 设置为直接内存:

  PULSAR_MEM: |
    -Xms4096m -Xmx4096m -XX:MaxDirectMemorySize=8192m

您可以更改 configmap,然后一一重启 pulsar bookie pod。

good idea , i will try, 3q

Gy1900 avatar May 15 '24 09:05 Gy1900

I'd close this issue, please free to file a new one

yanliang567 avatar Jun 26 '24 07:06 yanliang567