milvus
milvus copied to clipboard
[Bug]: Concurrent insert then trigger datanode memory protection, but do not trigger sync
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: bigsheeper-2303-memory-sync-debug-2eeefc3-20230328
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
- deploy milvus
- default config enable datanode.memory.forceSyncEnable
- datanode resource
Limits:
cpu: 2
memory: 6Gi
Requests:
cpu: 2
memory: 5Gi
- Create a collection and create 500 new partitions
- Concurrent insert data into random partitions
- datanode memory increase and triggers memory protection
deny to write, reason: memory quota exhausted, please allocate more resources, req: /milvus.proto.milvus.MilvusService/Insert)>
(api_request.py:48)
- But do not trigger sync
You can check the datanode logs in
fouram
clusterqa-milvus
namespace
fouramf-yihao2-op-38-1978-etcd-0 1/1 Running 0 32h 10.104.9.146 4am-node14 <none> <none>
fouramf-yihao2-op-38-1978-etcd-1 1/1 Running 0 32h 10.104.5.245 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-etcd-2 1/1 Running 0 32h 10.104.6.55 4am-node13 <none> <none>
fouramf-yihao2-op-38-1978-milvus-datacoord-7887849cbb-hhzdc 1/1 Running 0 32h 10.104.4.125 4am-node11 <none> <none>
fouramf-yihao2-op-38-1978-milvus-datanode-b54544f89-4knlc 1/1 Running 0 3h36m 10.104.9.215 4am-node14 <none> <none>
fouramf-yihao2-op-38-1978-milvus-indexcoord-688578f67b-5f6cf 1/1 Running 0 32h 10.104.5.253 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-milvus-indexnode-79df85b75-rtgvt 1/1 Running 0 32h 10.104.9.153 4am-node14 <none> <none>
fouramf-yihao2-op-38-1978-milvus-proxy-6f64df69c-pmmxh 1/1 Running 0 32h 10.104.5.254 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-milvus-querycoord-78bb4765c6-hczvm 1/1 Running 0 32h 10.104.1.83 4am-node10 <none> <none>
fouramf-yihao2-op-38-1978-milvus-querynode-55bccdbb59-cw8gd 1/1 Running 0 32h 10.104.4.126 4am-node11 <none> <none>
fouramf-yihao2-op-38-1978-milvus-rootcoord-7c6d58879d-jbgbs 1/1 Running 0 32h 10.104.5.2 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-minio-0 1/1 Running 0 32h 10.104.9.147 4am-node14 <none> <none>
fouramf-yihao2-op-38-1978-minio-1 1/1 Running 0 32h 10.104.5.248 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-minio-2 1/1 Running 0 32h 10.104.1.82 4am-node10 <none> <none>
fouramf-yihao2-op-38-1978-minio-3 1/1 Running 0 32h 10.104.4.115 4am-node11 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-0 1/1 Running 0 32h 10.104.9.150 4am-node14 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-1 1/1 Running 0 32h 10.104.5.252 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-2 1/1 Running 0 32h 10.104.4.124 4am-node11 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-init-kkrlh 0/1 Completed 0 32h 10.104.9.140 4am-node14 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-broker-0 1/1 Running 0 32h 10.104.5.243 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-proxy-0 1/1 Running 0 32h 10.104.5.244 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-pulsar-init-q7gz9 0/1 Completed 0 32h 10.104.9.141 4am-node14 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-recovery-0 1/1 Running 0 32h 10.104.9.142 4am-node14 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-0 1/1 Running 0 32h 10.104.5.249 4am-node12 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-1 1/1 Running 0 32h 10.104.6.57 4am-node13 <none> <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-2 1/1 Running 0 32h 10.104.9.152 4am-node14 <none> <none>
Grafana metrics:
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
/assign @bigsheeper
flowgraph & time tick blocked:
This msgstream consumes most of the memory:
/unassign
RootCoord produced tt latency is healthy, means the problem is not with the network or pulsar:
Proxy sent the same channel ts all the time:
Here are the root causes:
- Proxy’s time tick depends on the DML tasks in queue. It uses the minimum timestamp of DML tasks as the time tick when notifying root coordinator to update time tick. Therefore, Proxy needs to pop DML tasks when it finishes or fails.
- However, Proxy failed to describe collection at
popPChanStats
when popping DML task.
[2023/04/07 11:06:08.370 +08:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="rpc error: code = Canceled desc = context canceled"]
[2023/04/07 11:06:08.370 +08:00] [WARN] [grpcclient/client.go:318] ["ClientBase ReCall grpc first call get error"] [role=rootcoord] [error="err: rpc error: code = Canceled desc = context canceled\n, /home/sheep/workspace/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/home/sheep/workspace/milvus/internal/util/grpcclient/client.go:317 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/home/sheep/workspace/milvus/internal/distributed/rootcoord/client/client.go:233 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).DescribeCollection\n/home/sheep/workspace/milvus/internal/proxy/meta_cache.go:500 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).describeCollection\n/home/sheep/workspace/milvus/internal/proxy/meta_cache.go:224 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetCollectionID\n/home/sheep/workspace/milvus/internal/proxy/task_insert.go:86 github.com/milvus-io/milvus/internal/proxy.(*insertTask).getChannels\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:330 github.com/milvus-io/milvus/internal/proxy.(*dmTaskQueue).popPChanStats\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:271 github.com/milvus-io/milvus/internal/proxy.(*dmTaskQueue).PopActiveTask\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:477 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask.func1\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:511 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n"]
https://github.com/milvus-io/milvus/blob/9c463750330801057cc515a4b129eb6f40e9a1c9/internal/proxy/task_scheduler.go#L307-L334
@ThreadDao is it verified and fixed?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.