milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Investigate the Cpu Usage on small instance when the system is idle

Open xiaofan-luan opened this issue 1 year ago • 23 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

https://github.com/milvus-io/milvus/issues/22571 https://github.com/milvus-io/milvus/issues/17942

We will see many similar issues that said cpu is high after data insertion.

I have some wild guess about it, but we need to verify on how to reproduce:

  1. Single collection with enough data -> Create a 2c8G Standalone and insert 8GB memory into it, do some search and leave it for at least 1 day, see the data growth and cpu usage(should be less than 0.5 core)
  2. Multi collection -> Create a 2c8G Standalone and create 100 collections, see what will happen
  3. Multi collection - >Create a 2c8G Standalone and create 100 collections, each collection insert 10000 entities and trigger index build

Expected Behavior

CPU usage is below 1 CPU

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

xiaofan-luan avatar Apr 09 '23 16:04 xiaofan-luan

/assign @elstic could you please follow up this issue and test addition accordingly.

yanliang567 avatar Apr 11 '23 03:04 yanliang567

/assign @elstic could you please follow up this issue and test addition accordingly.

ok, let me look at the problem .

elstic avatar Apr 12 '23 02:04 elstic

😒I just deployed a standalone server with several collections...

image

image

elonzh avatar Apr 14 '23 02:04 elonzh

  1. After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c. image

  2. create 100 collections ,cpu usage 0.15, mem usage : 482m image image

  3. Conclusion: cpu use below 0.5c image

image

elstic avatar Apr 14 '23 15:04 elstic

😒I just deployed a standalone server with several collections...

image

image

Hi Elon, so you are seeing the increase of CPU usage? could you give me some clues about the cpu utilization details? pprof or perf works for me

xiaofan-luan avatar Apr 14 '23 18:04 xiaofan-luan

  1. After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c. image
  2. create 100 collections ,cpu usage 0.15, mem usage : 482m image image
  3. Conclusion: cpu use below 0.5c image

image How many collections are there in your account

xiaofan-luan avatar Apr 14 '23 18:04 xiaofan-luan

  1. After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c. image
  2. create 100 collections ,cpu usage 0.15, mem usage : 482m image image
  3. Conclusion: cpu use below 0.5c image

image How many collections are there in your test, could you add a few data entities into each collection also

xiaofan-luan avatar Apr 14 '23 18:04 xiaofan-luan

I have no idea about the increase in CPU usage. I stopped the server when received the alert.

But I am sure there are no requests when this issue happens because it is a test server for evaluation.

elonzh avatar Apr 15 '23 04:04 elonzh

How many collections are there in your test, could you add a few data entities into each collection also

@xiaofan-luan
My three tests are set up according to your requirements and they correspond to each one. The second test creates 100 collections and then does nothing. The third test creates 100 collections and adds 10,000 pieces of 128-dimensional data to each collection, and creates an index.

elstic avatar Apr 15 '23 05:04 elstic

I have no idea about the increase in CPU usage. I stopped the server when received the alert.

But I am sure there are no requests when this issue happens because it is a test server for evaluation.

hi @elonzh , I know what's going on. It will be fixed in 2.2.6, thanks for your feedback.

xiaofan-luan avatar Apr 15 '23 06:04 xiaofan-luan

The container log may offer some help.

image

image

https://wormhole.app/MvAn6#RRH8iE3nz6RqOppHGKaS4g

elonzh avatar Apr 15 '23 06:04 elonzh

@xiaofan-luan Will the fix be patched to v2.3? I am using v2.3.0-beta for Aliyun OSS provider support.

elonzh avatar Apr 16 '23 06:04 elonzh

it will be fix on 2.2.6

xiaofan-luan avatar Apr 16 '23 07:04 xiaofan-luan

don't use 2.3 into production yet, we are still working on fixes

xiaofan-luan avatar Apr 16 '23 07:04 xiaofan-luan

This is insane! Upgrade to 2.2.6 is not working even though I cleared all data.

Everything is growing linearly 🤣.

image image

image

elonzh avatar Apr 23 '23 06:04 elonzh

Seems it's rockmq issue. It creates some many goroutines.

image

pprof.milvus.samples.cpu.001.pb.gz

elonzh avatar Apr 23 '23 18:04 elonzh

Seems it's rockmq issue. It creates some many goroutines.

image

pprof.milvus.samples.cpu.001.pb.gz

What disk you are using? our QA team tried to reproduce on internal environment and get no clue. I tried with hundred collections but not help

xiaofan-luan avatar Apr 23 '23 19:04 xiaofan-luan

kind: StorageClass
metadata:
  name: alicloud-disk-essd
  uid: 65d085b6-64c4-4526-9ee1-5755f04589d9
  resourceVersion: '1489528'
  creationTimestamp: '2020-10-23T03:00:37Z'
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: >
      {"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"alicloud-disk-essd"},"parameters":{"type":"cloud_essd"},"provisioner":"diskplugin.csi.alibabacloud.com","reclaimPolicy":"Delete"}
allowVolumeExpansion: true
allowedTopologies: []
mountOptions: []
parameters:
  type: cloud_essd
provisioner: diskplugin.csi.alibabacloud.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

elonzh avatar Apr 23 '23 20:04 elonzh

how many collections, shards and partitions do you have/ essd should work perfectly

xiaofan-luan avatar Apr 23 '23 20:04 xiaofan-luan

and you don't change default time tick interval right? There seems to be many time ticks in the system

xiaofan-luan avatar Apr 23 '23 21:04 xiaofan-luan

image seems that most of the cpu is in the cgo call ,but that doesn't make too much sense too me unless there are many channels

xiaofan-luan avatar Apr 23 '23 21:04 xiaofan-luan

I changed the milvus config and reset the data, still not working.

extraConfigFiles:
  user.yaml: |+
    rocksmq:
      # The path where the message is stored in rocksmq
      lrucacheratio: 0.06 # rocksdb cache memory ratio
      rocksmqPageSize: 16777216 # default is 256 MB, 256 * 1024 * 1024 bytes, The size of each page of messages in rocksmq
      retentionTimeInMinutes: 1440 # default is 5 days, 5 * 24 * 60 minutes, The retention time of the message in rocksmq.
      retentionSizeInMB: 1024 # default is 8 GB, 8 * 1024 MB, The retention size of the message in rocksmq.
      compactionInterval: 86400 # 1 day, trigger rocksdb compaction every day to remove deleted data
    rootCoord:
      # changing this value will make the cluster unavailable
      dmlChannelNum: 4
    dataCoord:
      segment:
        maxSize: 128 # Maximum size of a segment in MB
        diskSegmentMaxSize: 256 # Maximun size of a segment in MB for collection which has Disk index

image

image

image

elonzh avatar Apr 24 '23 12:04 elonzh

I am using Alicloud ACS kubernetes.

System Images : Alibaba Cloud Linux 3 (Soaring Falcon) Kernel Version : 5.10.134-12.2.al8.x86_64
Kubelet Version : v1.24.6-aliyun.1 Kube-Proxy Version : v1.24.6-aliyun.1

elonzh avatar Apr 24 '23 12:04 elonzh

@elstic could we try to restart the standalone server after we insert enough data and keep the cluster for a while see if this is reproducible?

xiaofan-luan avatar Jun 11 '23 15:06 xiaofan-luan

this has to be happen with many collections. For instance 100 collections, each collection insert 100k data

xiaofan-luan avatar Jun 11 '23 15:06 xiaofan-luan

Deployment method:kuernetes Standalone

Deployed v2.2.6 instance, inserted 100,000 128-dimensional data into 85 collections, inserted 10,000 data into 120 collections, then upgraded to v2.2.9; waited 2 days, cpu usage went up from 1.5c two days ago to 2.5c now

image

server:

fouramf-xww99-15-5808-etcd-0                                      1/1     Running     0              2d10h
fouramf-xww99-15-5808-milvus-standalone-64df46f558-dlmq2          1/1     Running     1 (39h ago)    41h
fouramf-xww99-15-5808-minio-f7f566454-pj6sh                       1/1     Running     0              2d10h

@xiaofan-luan @aoiasd Instance still remains, please help to troubleshoot the issue

elstic avatar Jun 14 '23 03:06 elstic

Rocksdb cpu usage will increase with files num. When system idle, milvus keeps send ttmsg to rocksmq, and cpu usage will increase with data in rocksmq. Till rocksmq start retention, after 3 days or large than 8G default. So set a smaller retention time or retention size may help(rocksmq.retentionTimeInMinutes and rocksmq.retentionSizeInMB at milvus.config).

We won't send ttmsg by message queue at master, so this question has been solved at master.

aoiasd avatar Jul 20 '23 08:07 aoiasd

Rocksdb cpu usage will increase with files num. When system idle, milvus keeps send ttmsg to rocksmq, and cpu usage will increase with data in rocksmq. Till rocksmq start retention, after 3 days or large than 8G default. So set a smaller retention time or retention size may help(rocksmq.retentionTimeInMinutes and rocksmq.retentionSizeInMB at milvus.config).

We won't send ttmsg by message queue at master, so this question has been solved at master.

There will be still tt messages in dml channel, and this is already in 2.2.12. And yes, this will be alleviated with the new implmentation

xiaofan-luan avatar Jul 22 '23 04:07 xiaofan-luan

Have too many collections/partiitons in may still cause this problem. @elonzh if you can do more test on the newly comed 2.2.12 it would be really helpful

xiaofan-luan avatar Jul 22 '23 04:07 xiaofan-luan

@elonzh v2.2.12 was released this week. please try to run your tests if convenient. thanks.

yanliang567 avatar Jul 27 '23 02:07 yanliang567