milvus [Bug]: datanode memory usage increased to 150GB when there are 50m vectors to be flush

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: master-20230805-241117dd
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

I scaled the datanode down to 0, and insert 50m_768d vectors. Then I scaled the datanode up to 1, the datanode memory usage increased to 150GB in 15 mins

Expected Behavior

the baseline on master-20230802-df26b909: datanode memory usage is about 2.3-1.3GB for the same size of vectors.

Steps To Reproduce

1. create a collection with 20k_768d vectors, build hnsw index
2. scale down the datanode to 0
3. insert 50m_768d vectors
4. scale up the datanode back to 1
5. wait and check the tt lag, datanode cpu and memory

Milvus Log

pod names on devops:

yanliang-ttlag-milvus-datanode-cbf79cbdc-bx4h6                  1/1     Running       0               34m     10.102.7.245    devops-node11   <none>           <none>
yanliang-ttlag-milvus-indexnode-6699c566d7-9l49n                1/1     Running       2 (2m16s ago)   6h54m   10.102.7.231    devops-node11   <none>           <none>
yanliang-ttlag-milvus-mixcoord-987654d85-pfzg2                  1/1     Running       0               6h54m   10.102.7.238    devops-node11   <none>           <none>
yanliang-ttlag-milvus-proxy-df7b5955f-5twjd                     1/1     Running       0               6h54m   10.102.7.239    devops-node11   <none>           <none>
yanliang-ttlag-milvus-querynode-76cf9c9b55-rcwx9                1/1     Running       0               6h54m   10.102.7.232    devops-node11   <none>           <none>

Anything else?

the suspected pr: #26144

Aug 07 '23 08:08 yanliang567

/assign @congqixia /unassign

Aug 07 '23 08:08 yanliang567

from the pprof, there are lot's of msg pack buffered in memory

there are some channels that are too large which could cause this problem:

MsgStream buffers(mq buffer & receive buffer) 1024*2
Flowgraph node buffer (input node -> dd node -> insert buffer node) 1024*2

in high read pressure, all channels shall be full with will lead to 102448MB memory cost

And the flush manager will buffer flush task as well, which will multiply this memory cost.

Aug 07 '23 09:08 congqixia

@yanliang567 after #26179 merged, could you please verify with this parameter enlarged

dataNode:
  dataSync:
    maxParallelSyncTaskNum: 2 # Maximum number of sync tasks executed in parallel in each flush manager

Aug 09 '23 03:08 congqixia

can we simplify the mqstream logic to make it easier to understand?

Aug 13 '23 07:08 xiaofan-luan

@congqixia any plans for fixing this issue in v2.3.2?

Oct 16 '23 09:10 yanliang567

@yanliang567 nope, l0 delta and other datanode refining will be implemented after 2.3.2

Oct 16 '23 09:10 congqixia

moving to 2.3.3

Oct 16 '23 10:10 yanliang567

moving to 2.4 for L0 deletion

Dec 05 '23 08:12 yanliang567

@yanliang567 now we shall verify where this problem persists when l0 segment is enabled /assign @yanliang567

Mar 05 '24 08:03 congqixia

will do as L0 segment enabled. /unassign @congqixia

Mar 05 '24 11:03 yanliang567

test on 2.4-20240407-e3b65203-amd64 datanode memory goes to 38GB

and ttlag catch up from 5.3h to 200ms in about 60mins

Apr 07 '24 10:04 yanliang567

I thought the key might to increase flush concurrency make sure flush can catch up insertion rate

Apr 07 '24 18:04 xiaofan-luan

/assign @congqixia

Apr 07 '24 18:04 xiaofan-luan

@xiaofan-luan the scenario here is to verify the datanode behavior when datanode is down for a long time

@yanliang567 the last run did not limit the memory of datanode. Memory usage went to arount 40GB. Maybe it's still an issue here. Let's check what the behavior is when datanode has memory limit.

The catch-up time is about one hour for 5 hour ttlag with insertion. Does this value good enough for our system? @xiaofan-luan @yanliang567 @tedxu @jaime0815

Apr 08 '24 02:04 congqixia

@xiaofan-luan the scenario here is to verify the datanode behavior when datanode is down for a long time

@yanliang567 the last run did not limit the memory of datanode. Memory usage went to arount 40GB. Maybe it's still an issue here. Let's check what the behavior is when datanode has memory limit.

The catch-up time is about one hour for 5 hour ttlag with insertion. Does this value good enough for our system? @xiaofan-luan @yanliang567 @tedxu @jaime0815

How long do we stop the cluster?
Is there anything we can improve? what is the bottleneck?

Apr 08 '24 05:04 xiaofan-luan

we did not stop the cluster, we just scale down the datanode replica to 0, and insert for 6 hours(~50M_768d data), and then bring one datanode up. @congqixia is trying to make a pr

Apr 10 '24 01:04 yanliang567

on master-20240426-bed6363f, the tt lag catches up quickly, but the datanode uses memory without any limitation, OOM occurs for times in a 8c32g datanode pod.

Apr 28 '24 09:04 yanliang567

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Jun 10 '24 06:06 stale[bot]

milvus milvus copied to clipboard

[Bug]: datanode memory usage increased to 150GB when there are 50m vectors to be flush

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard