milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: datanode memory usage increased to 150GB when there are 50m vectors to be flush

Open yanliang567 opened this issue 1 year ago • 17 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20230805-241117dd
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

I scaled the datanode down to 0, and insert 50m_768d vectors. Then I scaled the datanode up to 1, the datanode memory usage increased to 150GB in 15 mins image

Expected Behavior

the baseline on master-20230802-df26b909: datanode memory usage is about 2.3-1.3GB for the same size of vectors.

Steps To Reproduce

1. create a collection with 20k_768d vectors, build hnsw index
2. scale down the datanode to 0
3. insert 50m_768d vectors
4. scale up the datanode back to 1
5. wait and check the tt lag, datanode cpu and memory

Milvus Log

pod names on devops:

yanliang-ttlag-milvus-datanode-cbf79cbdc-bx4h6                  1/1     Running       0               34m     10.102.7.245    devops-node11   <none>           <none>
yanliang-ttlag-milvus-indexnode-6699c566d7-9l49n                1/1     Running       2 (2m16s ago)   6h54m   10.102.7.231    devops-node11   <none>           <none>
yanliang-ttlag-milvus-mixcoord-987654d85-pfzg2                  1/1     Running       0               6h54m   10.102.7.238    devops-node11   <none>           <none>
yanliang-ttlag-milvus-proxy-df7b5955f-5twjd                     1/1     Running       0               6h54m   10.102.7.239    devops-node11   <none>           <none>
yanliang-ttlag-milvus-querynode-76cf9c9b55-rcwx9                1/1     Running       0               6h54m   10.102.7.232    devops-node11   <none>           <none>

Anything else?

the suspected pr: #26144

yanliang567 avatar Aug 07 '23 08:08 yanliang567

/assign @congqixia /unassign

yanliang567 avatar Aug 07 '23 08:08 yanliang567

from the pprof, there are lot's of msg pack buffered in memory image

there are some channels that are too large which could cause this problem:

  • MsgStream buffers(mq buffer & receive buffer) 1024*2
  • Flowgraph node buffer (input node -> dd node -> insert buffer node) 1024*2

in high read pressure, all channels shall be full with will lead to 102448MB memory cost

And the flush manager will buffer flush task as well, which will multiply this memory cost.

congqixia avatar Aug 07 '23 09:08 congqixia

@yanliang567 after #26179 merged, could you please verify with this parameter enlarged

dataNode:
  dataSync:
    maxParallelSyncTaskNum: 2 # Maximum number of sync tasks executed in parallel in each flush manager

congqixia avatar Aug 09 '23 03:08 congqixia

can we simplify the mqstream logic to make it easier to understand?

xiaofan-luan avatar Aug 13 '23 07:08 xiaofan-luan

@congqixia any plans for fixing this issue in v2.3.2?

yanliang567 avatar Oct 16 '23 09:10 yanliang567

@yanliang567 nope, l0 delta and other datanode refining will be implemented after 2.3.2

congqixia avatar Oct 16 '23 09:10 congqixia

moving to 2.3.3

yanliang567 avatar Oct 16 '23 10:10 yanliang567

moving to 2.4 for L0 deletion

yanliang567 avatar Dec 05 '23 08:12 yanliang567

@yanliang567 now we shall verify where this problem persists when l0 segment is enabled /assign @yanliang567

congqixia avatar Mar 05 '24 08:03 congqixia

will do as L0 segment enabled. /unassign @congqixia

yanliang567 avatar Mar 05 '24 11:03 yanliang567

test on 2.4-20240407-e3b65203-amd64 datanode memory goes to 38GB image

and ttlag catch up from 5.3h to 200ms in about 60mins image

yanliang567 avatar Apr 07 '24 10:04 yanliang567

I thought the key might to increase flush concurrency make sure flush can catch up insertion rate

xiaofan-luan avatar Apr 07 '24 18:04 xiaofan-luan

/assign @congqixia

xiaofan-luan avatar Apr 07 '24 18:04 xiaofan-luan

@xiaofan-luan the scenario here is to verify the datanode behavior when datanode is down for a long time

@yanliang567 the last run did not limit the memory of datanode. Memory usage went to arount 40GB. Maybe it's still an issue here. Let's check what the behavior is when datanode has memory limit.

The catch-up time is about one hour for 5 hour ttlag with insertion. Does this value good enough for our system? @xiaofan-luan @yanliang567 @tedxu @jaime0815

congqixia avatar Apr 08 '24 02:04 congqixia

@xiaofan-luan the scenario here is to verify the datanode behavior when datanode is down for a long time

@yanliang567 the last run did not limit the memory of datanode. Memory usage went to arount 40GB. Maybe it's still an issue here. Let's check what the behavior is when datanode has memory limit.

The catch-up time is about one hour for 5 hour ttlag with insertion. Does this value good enough for our system? @xiaofan-luan @yanliang567 @tedxu @jaime0815

  1. How long do we stop the cluster?
  2. Is there anything we can improve? what is the bottleneck?

xiaofan-luan avatar Apr 08 '24 05:04 xiaofan-luan

we did not stop the cluster, we just scale down the datanode replica to 0, and insert for 6 hours(~50M_768d data), and then bring one datanode up. @congqixia is trying to make a pr

yanliang567 avatar Apr 10 '24 01:04 yanliang567

on master-20240426-bed6363f, the tt lag catches up quickly, but the datanode uses memory without any limitation, OOM occurs for times in a 8c32g datanode pod. image image

yanliang567 avatar Apr 28 '24 09:04 yanliang567

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 10 '24 06:06 stale[bot]