milvus
milvus copied to clipboard
[Bug]: datanode memory usage increased to 150GB when there are 50m vectors to be flush
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: master-20230805-241117dd
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
I scaled the datanode down to 0, and insert 50m_768d vectors. Then I scaled the datanode up to 1, the datanode memory usage increased to 150GB in 15 mins
Expected Behavior
the baseline on master-20230802-df26b909: datanode memory usage is about 2.3-1.3GB for the same size of vectors.
Steps To Reproduce
1. create a collection with 20k_768d vectors, build hnsw index
2. scale down the datanode to 0
3. insert 50m_768d vectors
4. scale up the datanode back to 1
5. wait and check the tt lag, datanode cpu and memory
Milvus Log
pod names on devops:
yanliang-ttlag-milvus-datanode-cbf79cbdc-bx4h6 1/1 Running 0 34m 10.102.7.245 devops-node11 <none> <none>
yanliang-ttlag-milvus-indexnode-6699c566d7-9l49n 1/1 Running 2 (2m16s ago) 6h54m 10.102.7.231 devops-node11 <none> <none>
yanliang-ttlag-milvus-mixcoord-987654d85-pfzg2 1/1 Running 0 6h54m 10.102.7.238 devops-node11 <none> <none>
yanliang-ttlag-milvus-proxy-df7b5955f-5twjd 1/1 Running 0 6h54m 10.102.7.239 devops-node11 <none> <none>
yanliang-ttlag-milvus-querynode-76cf9c9b55-rcwx9 1/1 Running 0 6h54m 10.102.7.232 devops-node11 <none> <none>
Anything else?
the suspected pr: #26144
/assign @congqixia /unassign
from the pprof, there are lot's of msg pack buffered in memory
there are some channels that are too large which could cause this problem:
- MsgStream buffers(mq buffer & receive buffer) 1024*2
- Flowgraph node buffer (input node -> dd node -> insert buffer node) 1024*2
in high read pressure, all channels shall be full with will lead to 102448MB memory cost
And the flush manager will buffer flush task as well, which will multiply this memory cost.
@yanliang567 after #26179 merged, could you please verify with this parameter enlarged
dataNode:
dataSync:
maxParallelSyncTaskNum: 2 # Maximum number of sync tasks executed in parallel in each flush manager
can we simplify the mqstream logic to make it easier to understand?
@congqixia any plans for fixing this issue in v2.3.2?
@yanliang567 nope, l0 delta and other datanode refining will be implemented after 2.3.2
moving to 2.3.3
moving to 2.4 for L0 deletion
@yanliang567 now we shall verify where this problem persists when l0 segment is enabled /assign @yanliang567
will do as L0 segment enabled. /unassign @congqixia
test on 2.4-20240407-e3b65203-amd64
datanode memory goes to 38GB
and ttlag catch up from 5.3h to 200ms in about 60mins
I thought the key might to increase flush concurrency make sure flush can catch up insertion rate
/assign @congqixia
@xiaofan-luan the scenario here is to verify the datanode behavior when datanode is down for a long time
@yanliang567 the last run did not limit the memory of datanode. Memory usage went to arount 40GB. Maybe it's still an issue here. Let's check what the behavior is when datanode has memory limit.
The catch-up time is about one hour for 5 hour ttlag with insertion. Does this value good enough for our system? @xiaofan-luan @yanliang567 @tedxu @jaime0815
@xiaofan-luan the scenario here is to verify the datanode behavior when datanode is down for a long time
@yanliang567 the last run did not limit the memory of datanode. Memory usage went to arount 40GB. Maybe it's still an issue here. Let's check what the behavior is when datanode has memory limit.
The catch-up time is about one hour for 5 hour ttlag with insertion. Does this value good enough for our system? @xiaofan-luan @yanliang567 @tedxu @jaime0815
- How long do we stop the cluster?
- Is there anything we can improve? what is the bottleneck?
we did not stop the cluster, we just scale down the datanode replica to 0, and insert for 6 hours(~50M_768d data), and then bring one datanode up. @congqixia is trying to make a pr
on master-20240426-bed6363f, the tt lag catches up quickly, but the datanode uses memory without any limitation, OOM occurs for times in a 8c32g datanode pod.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.