milvus
milvus copied to clipboard
[Bug]: Investigate the Cpu Usage on small instance when the system is idle
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
https://github.com/milvus-io/milvus/issues/22571 https://github.com/milvus-io/milvus/issues/17942
We will see many similar issues that said cpu is high after data insertion.
I have some wild guess about it, but we need to verify on how to reproduce:
- Single collection with enough data -> Create a 2c8G Standalone and insert 8GB memory into it, do some search and leave it for at least 1 day, see the data growth and cpu usage(should be less than 0.5 core)
- Multi collection -> Create a 2c8G Standalone and create 100 collections, see what will happen
- Multi collection - >Create a 2c8G Standalone and create 100 collections, each collection insert 10000 entities and trigger index build
Expected Behavior
CPU usage is below 1 CPU
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
/assign @elstic could you please follow up this issue and test addition accordingly.
/assign @elstic could you please follow up this issue and test addition accordingly.
ok, let me look at the problem .
😒I just deployed a standalone server with several collections...
-
After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c.
-
create 100 collections ,cpu usage 0.15, mem usage : 482m
-
Conclusion: cpu use below 0.5c
😒I just deployed a standalone server with several collections...
Hi Elon, so you are seeing the increase of CPU usage? could you give me some clues about the cpu utilization details? pprof or perf works for me
- After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c.
![]()
- create 100 collections ,cpu usage 0.15, mem usage : 482m
![]()
![]()
- Conclusion: cpu use below 0.5c
![]()
How many collections are there in your account
- After 3h of testing, the cpu resource usage was reduced to less than 0.5, using about 0.05c.
![]()
- create 100 collections ,cpu usage 0.15, mem usage : 482m
![]()
![]()
- Conclusion: cpu use below 0.5c
![]()
How many collections are there in your test, could you add a few data entities into each collection also
I have no idea about the increase in CPU usage. I stopped the server when received the alert.
But I am sure there are no requests when this issue happens because it is a test server for evaluation.
How many collections are there in your test, could you add a few data entities into each collection also
@xiaofan-luan
My three tests are set up according to your requirements and they correspond to each one. The second test creates 100 collections and then does nothing.
The third test creates 100 collections and adds 10,000 pieces of 128-dimensional data to each collection, and creates an index.
I have no idea about the increase in CPU usage. I stopped the server when received the alert.
But I am sure there are no requests when this issue happens because it is a test server for evaluation.
hi @elonzh , I know what's going on. It will be fixed in 2.2.6, thanks for your feedback.
The container log may offer some help.
https://wormhole.app/MvAn6#RRH8iE3nz6RqOppHGKaS4g
@xiaofan-luan Will the fix be patched to v2.3? I am using v2.3.0-beta
for Aliyun OSS provider support.
it will be fix on 2.2.6
don't use 2.3 into production yet, we are still working on fixes
This is insane! Upgrade to 2.2.6 is not working even though I cleared all data.
Everything is growing linearly 🤣.
Seems it's rockmq issue. It creates some many goroutines.
What disk you are using? our QA team tried to reproduce on internal environment and get no clue. I tried with hundred collections but not help
kind: StorageClass
metadata:
name: alicloud-disk-essd
uid: 65d085b6-64c4-4526-9ee1-5755f04589d9
resourceVersion: '1489528'
creationTimestamp: '2020-10-23T03:00:37Z'
annotations:
kubectl.kubernetes.io/last-applied-configuration: >
{"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"alicloud-disk-essd"},"parameters":{"type":"cloud_essd"},"provisioner":"diskplugin.csi.alibabacloud.com","reclaimPolicy":"Delete"}
allowVolumeExpansion: true
allowedTopologies: []
mountOptions: []
parameters:
type: cloud_essd
provisioner: diskplugin.csi.alibabacloud.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
how many collections, shards and partitions do you have/ essd should work perfectly
and you don't change default time tick interval right? There seems to be many time ticks in the system

I changed the milvus config and reset the data, still not working.
extraConfigFiles:
user.yaml: |+
rocksmq:
# The path where the message is stored in rocksmq
lrucacheratio: 0.06 # rocksdb cache memory ratio
rocksmqPageSize: 16777216 # default is 256 MB, 256 * 1024 * 1024 bytes, The size of each page of messages in rocksmq
retentionTimeInMinutes: 1440 # default is 5 days, 5 * 24 * 60 minutes, The retention time of the message in rocksmq.
retentionSizeInMB: 1024 # default is 8 GB, 8 * 1024 MB, The retention size of the message in rocksmq.
compactionInterval: 86400 # 1 day, trigger rocksdb compaction every day to remove deleted data
rootCoord:
# changing this value will make the cluster unavailable
dmlChannelNum: 4
dataCoord:
segment:
maxSize: 128 # Maximum size of a segment in MB
diskSegmentMaxSize: 256 # Maximun size of a segment in MB for collection which has Disk index
I am using Alicloud ACS kubernetes.
System Images : Alibaba Cloud Linux 3 (Soaring Falcon) | Kernel Version : 5.10.134-12.2.al8.x86_64 |
---|---|
Kubelet Version : v1.24.6-aliyun.1 | Kube-Proxy Version : v1.24.6-aliyun.1 |
@elstic could we try to restart the standalone server after we insert enough data and keep the cluster for a while see if this is reproducible?
this has to be happen with many collections. For instance 100 collections, each collection insert 100k data
Deployment method:kuernetes Standalone
Deployed v2.2.6 instance, inserted 100,000 128-dimensional data into 85 collections, inserted 10,000 data into 120 collections, then upgraded to v2.2.9; waited 2 days, cpu usage went up from 1.5c two days ago to 2.5c now
server:
fouramf-xww99-15-5808-etcd-0 1/1 Running 0 2d10h
fouramf-xww99-15-5808-milvus-standalone-64df46f558-dlmq2 1/1 Running 1 (39h ago) 41h
fouramf-xww99-15-5808-minio-f7f566454-pj6sh 1/1 Running 0 2d10h
@xiaofan-luan @aoiasd Instance still remains, please help to troubleshoot the issue
Rocksdb cpu usage will increase with files num. When system idle, milvus keeps send ttmsg to rocksmq, and cpu usage will increase with data in rocksmq. Till rocksmq start retention, after 3 days or large than 8G default. So set a smaller retention time or retention size may help(rocksmq.retentionTimeInMinutes and rocksmq.retentionSizeInMB at milvus.config).
We won't send ttmsg by message queue at master, so this question has been solved at master.
Rocksdb cpu usage will increase with files num. When system idle, milvus keeps send ttmsg to rocksmq, and cpu usage will increase with data in rocksmq. Till rocksmq start retention, after 3 days or large than 8G default. So set a smaller retention time or retention size may help(rocksmq.retentionTimeInMinutes and rocksmq.retentionSizeInMB at milvus.config).
We won't send ttmsg by message queue at master, so this question has been solved at master.
There will be still tt messages in dml channel, and this is already in 2.2.12. And yes, this will be alleviated with the new implmentation
Have too many collections/partiitons in may still cause this problem. @elonzh if you can do more test on the newly comed 2.2.12 it would be really helpful
@elonzh v2.2.12 was released this week. please try to run your tests if convenient. thanks.