milvus [Bug]: Investigate the Cpu Usage on small instance when the system is idle

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

https://github.com/milvus-io/milvus/issues/22571 https://github.com/milvus-io/milvus/issues/17942

We will see many similar issues that said cpu is high after data insertion.

I have some wild guess about it, but we need to verify on how to reproduce:

Single collection with enough data -> Create a 2c8G Standalone and insert 8GB memory into it, do some search and leave it for at least 1 day, see the data growth and cpu usage(should be less than 0.5 core)
Multi collection -> Create a 2c8G Standalone and create 100 collections, see what will happen
Multi collection - >Create a 2c8G Standalone and create 100 collections, each collection insert 10000 entities and trigger index build

Expected Behavior

CPU usage is below 1 CPU

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

Apr 09 '23 16:04 xiaofan-luan

/assign @elstic could you please follow up this issue and test addition accordingly.

Apr 11 '23 03:04 yanliang567

/assign @elstic could you please follow up this issue and test addition accordingly.

ok, let me look at the problem .

Apr 12 '23 02:04 elstic

😒I just deployed a standalone server with several collections...

Apr 14 '23 02:04 elonzh

After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c.
create 100 collections ,cpu usage 0.15, mem usage : 482m
Conclusion: cpu use below 0.5c

Apr 14 '23 15:04 elstic

😒I just deployed a standalone server with several collections...

Hi Elon, so you are seeing the increase of CPU usage? could you give me some clues about the cpu utilization details? pprof or perf works for me

Apr 14 '23 18:04 xiaofan-luan

After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c.

create 100 collections ,cpu usage 0.15, mem usage : 482m

Conclusion: cpu use below 0.5c

How many collections are there in your account

Apr 14 '23 18:04 xiaofan-luan

After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c.

create 100 collections ,cpu usage 0.15, mem usage : 482m

Conclusion: cpu use below 0.5c

How many collections are there in your test， could you add a few data entities into each collection also

Apr 14 '23 18:04 xiaofan-luan

I have no idea about the increase in CPU usage. I stopped the server when received the alert.

But I am sure there are no requests when this issue happens because it is a test server for evaluation.

Apr 15 '23 04:04 elonzh

How many collections are there in your test， could you add a few data entities into each collection also

@xiaofan-luan
My three tests are set up according to your requirements and they correspond to each one. The second test creates 100 collections and then does nothing. The third test creates 100 collections and adds 10,000 pieces of 128-dimensional data to each collection, and creates an index.

Apr 15 '23 05:04 elstic

I have no idea about the increase in CPU usage. I stopped the server when received the alert.

But I am sure there are no requests when this issue happens because it is a test server for evaluation.

hi @elonzh , I know what's going on. It will be fixed in 2.2.6, thanks for your feedback.

Apr 15 '23 06:04 xiaofan-luan

The container log may offer some help.

https://wormhole.app/MvAn6#RRH8iE3nz6RqOppHGKaS4g

Apr 15 '23 06:04 elonzh

@xiaofan-luan Will the fix be patched to v2.3? I am using v2.3.0-beta for Aliyun OSS provider support.

Apr 16 '23 06:04 elonzh

it will be fix on 2.2.6

Apr 16 '23 07:04 xiaofan-luan

don't use 2.3 into production yet, we are still working on fixes

Apr 16 '23 07:04 xiaofan-luan

This is insane! Upgrade to 2.2.6 is not working even though I cleared all data.

Everything is growing linearly 🤣.

Apr 23 '23 06:04 elonzh

Seems it's rockmq issue. It creates some many goroutines.

pprof.milvus.samples.cpu.001.pb.gz

Apr 23 '23 18:04 elonzh

Seems it's rockmq issue. It creates some many goroutines.

pprof.milvus.samples.cpu.001.pb.gz

What disk you are using? our QA team tried to reproduce on internal environment and get no clue. I tried with hundred collections but not help

Apr 23 '23 19:04 xiaofan-luan

kind: StorageClass
metadata:
  name: alicloud-disk-essd
  uid: 65d085b6-64c4-4526-9ee1-5755f04589d9
  resourceVersion: '1489528'
  creationTimestamp: '2020-10-23T03:00:37Z'
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: >
      {"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"alicloud-disk-essd"},"parameters":{"type":"cloud_essd"},"provisioner":"diskplugin.csi.alibabacloud.com","reclaimPolicy":"Delete"}
allowVolumeExpansion: true
allowedTopologies: []
mountOptions: []
parameters:
  type: cloud_essd
provisioner: diskplugin.csi.alibabacloud.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Apr 23 '23 20:04 elonzh

how many collections, shards and partitions do you have/ essd should work perfectly

Apr 23 '23 20:04 xiaofan-luan

and you don't change default time tick interval right? There seems to be many time ticks in the system

Apr 23 '23 21:04 xiaofan-luan

seems that most of the cpu is in the cgo call ,but that doesn't make too much sense too me unless there are many channels

Apr 23 '23 21:04 xiaofan-luan

I changed the milvus config and reset the data, still not working.

extraConfigFiles:
  user.yaml: |+
    rocksmq:
      # The path where the message is stored in rocksmq
      lrucacheratio: 0.06 # rocksdb cache memory ratio
      rocksmqPageSize: 16777216 # default is 256 MB, 256 * 1024 * 1024 bytes, The size of each page of messages in rocksmq
      retentionTimeInMinutes: 1440 # default is 5 days, 5 * 24 * 60 minutes, The retention time of the message in rocksmq.
      retentionSizeInMB: 1024 # default is 8 GB, 8 * 1024 MB, The retention size of the message in rocksmq.
      compactionInterval: 86400 # 1 day, trigger rocksdb compaction every day to remove deleted data
    rootCoord:
      # changing this value will make the cluster unavailable
      dmlChannelNum: 4
    dataCoord:
      segment:
        maxSize: 128 # Maximum size of a segment in MB
        diskSegmentMaxSize: 256 # Maximun size of a segment in MB for collection which has Disk index

Apr 24 '23 12:04 elonzh

I am using Alicloud ACS kubernetes.

System Images : Alibaba Cloud Linux 3 (Soaring Falcon)	Kernel Version : 5.10.134-12.2.al8.x86_64
Kubelet Version : v1.24.6-aliyun.1	Kube-Proxy Version : v1.24.6-aliyun.1

Apr 24 '23 12:04 elonzh

@elstic could we try to restart the standalone server after we insert enough data and keep the cluster for a while see if this is reproducible?

Jun 11 '23 15:06 xiaofan-luan

this has to be happen with many collections. For instance 100 collections, each collection insert 100k data

Jun 11 '23 15:06 xiaofan-luan

Deployment method:kuernetes Standalone

Deployed v2.2.6 instance, inserted 100,000 128-dimensional data into 85 collections, inserted 10,000 data into 120 collections, then upgraded to v2.2.9; waited 2 days, cpu usage went up from 1.5c two days ago to 2.5c now

server:

fouramf-xww99-15-5808-etcd-0                                      1/1     Running     0              2d10h
fouramf-xww99-15-5808-milvus-standalone-64df46f558-dlmq2          1/1     Running     1 (39h ago)    41h
fouramf-xww99-15-5808-minio-f7f566454-pj6sh                       1/1     Running     0              2d10h

@xiaofan-luan @aoiasd Instance still remains, please help to troubleshoot the issue

Jun 14 '23 03:06 elstic

Rocksdb cpu usage will increase with files num. When system idle, milvus keeps send ttmsg to rocksmq, and cpu usage will increase with data in rocksmq. Till rocksmq start retention, after 3 days or large than 8G default. So set a smaller retention time or retention size may help(rocksmq.retentionTimeInMinutes and rocksmq.retentionSizeInMB at milvus.config).

We won't send ttmsg by message queue at master, so this question has been solved at master.

Jul 20 '23 08:07 aoiasd

Rocksdb cpu usage will increase with files num. When system idle, milvus keeps send ttmsg to rocksmq, and cpu usage will increase with data in rocksmq. Till rocksmq start retention, after 3 days or large than 8G default. So set a smaller retention time or retention size may help(rocksmq.retentionTimeInMinutes and rocksmq.retentionSizeInMB at milvus.config).

We won't send ttmsg by message queue at master, so this question has been solved at master.

There will be still tt messages in dml channel, and this is already in 2.2.12. And yes, this will be alleviated with the new implmentation

Jul 22 '23 04:07 xiaofan-luan

Have too many collections/partiitons in may still cause this problem. @elonzh if you can do more test on the newly comed 2.2.12 it would be really helpful

Jul 22 '23 04:07 xiaofan-luan

@elonzh v2.2.12 was released this week. please try to run your tests if convenient. thanks.

Jul 27 '23 02:07 yanliang567

milvus milvus copied to clipboard

[Bug]: Investigate the Cpu Usage on small instance when the system is idle

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard