milvus
milvus copied to clipboard
[Bug]: CPU as high as 10% even no search/insert/query actions
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: Milvus 2.2.2
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.0
- OS(Ubuntu or CentOS): Debian 11.6
- CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G
- GPU: None
- Others:
Current Behavior
After I used docker-compose to start the standalone version of Milvus, the CPU keeps as high as 10% even no any action is taken.
Is this normal?
I just inserted 10,000 entities in the database. Is it buidling index?
Expected Behavior
when not used, the cpu should b low
Steps To Reproduce
No response
Milvus Log
It keeps printing this log
milvus-standalone | [2023/03/03 23:03:56.944 +00:00] [DEBUG] [datacoord/services.go:877] [DataCoord.GetMetrics] [node_id=133] [req="{"metric_type":"system_info"}"] [metric_type=system_info] [metrics="status:<> response:"{\"cluster\":{\"self\":{\"has_error\":false,\"error_reason\":\"\",\"name\":\"datacoord133\",\"hardware_infos\":{\"ip\":\"172.20.0.4:13333\",\"cpu_core_count\":4,\"cpu_core_usage\":0,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"system_info\":{\"system_version\":\"0fdc1a04\",\"deploy_mode\":\"STANDALONE\",\"build_version\":\"v2.2.2\",\"build_time\":\"Thu Dec 22 09:08:37 UTC 2022\",\"used_go_version\":\"go version go1.18.3 linux/amd64\"},\"created_time\":\"2023-03-03 22:58:42.215947162 +0000 UTC m=+4.471734750\",\"updated_time\":\"2023-03-03 22:58:42.215947492 +0000 UTC m=+4.471735031\",\"type\":\"datacoord\",\"id\":133,\"system_configurations\":{\"segment_max_size\":512},\"quota_metrics\":{\"TotalBinlogSize\":503362742}},\"connected_nodes\":[{\"has_error\":false,\"error_reason\":\"\",\"name\":\"datanode134\",\"hardware_infos\":{\"ip\":\"172.20.0.4:21124\",\"cpu_core_count\":4,\"cpu_core_usage\":100,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"system_info\":{\"system_version\":\"0fdc1a04\",\"deploy_mode\":\"STANDALONE\",\"build_version\":\"v2.2.2\",\"build_time\":\"Thu Dec 22 09:08:37 UTC 2022\",\"used_go_version\":\"go version go1.18.3 linux/amd64\"},\"created_time\":\"2023-03-03 22:58:42.401055749 +0000 UTC m=+4.656843300\",\"updated_time\":\"2023-03-03 22:58:42.401056126 +0000 UTC m=+4.656843681\",\"type\":\"datanode\",\"id\":134,\"system_configurations\":{\"flush_insert_buffer_size\":16777216},\"quota_metrics\":{\"Hms\":{\"ip\":\"172.20.0.4:21124\",\"cpu_core_count\":4,\"cpu_core_usage\":100,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"Rms\":[{\"Label\":\"InsertConsumeThroughput\",\"Rate\":0},{\"Label\":\"DeleteConsumeThroughput\",\"Rate\":0}],\"Fgm\":{\"MinFlowGraphChannel\":\"by-dev-rootcoord-dml_1_439715925445053623v0\",\"MinFlowGraphTt\":439847390212259841,\"NumFlowGraph\":2}}}]},\"connections\":{\"name\":\"datacoord133\",\"connected_components\":[]}}" component_name:"datacoord133" "] [] milvus-standalone | [2023/03/03 23:03:56.946 +00:00] [INFO] [proxy/multi_rate_limiter.go:96] ["RateLimiter setRates"] [rates="[{"rt":5,"r":1.7976931348623157e+308},{"rt":6,"r":1.7976931348623157e+308},{"rt":9,"r":1.7976931348623157e+308},{"rt":7,"r":1.7976931348623157e+308},{"rt":8,"r":1.7976931348623157e+308}]"] milvus-standalone | [2023/03/03 23:03:57.936 +00:00] [DEBUG] [querynode/flow_graph_service_time_node.go:70] ["update tSafe:"] [collectionID=439715925445053623] [tSafe=439847390474141697] [tSafe_p=2023/03/03 23:03:57.734 +00:00] [tsLag=202.635927ms] [channel=by-dev-rootcoord-dml_1_439715925445053623v0]
and
milvus-standalone | [2023/03/03 23:04:42.203 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439715925445053623] [indexName=] [segIDs="[439715925445288290]"] milvus-standalone | [2023/03/03 23:04:42.203 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439715925445053623] [indexName=] [segIDs="[439833472321716287]"] milvus-standalone | [2023/03/03 23:04:42.205 +00:00] [INFO] [indexcoord/index_coord.go:829] ["IndexCoord DescribeIndex"] [collectionID=439715925445053623] [indexName=] milvus-standalone | [2023/03/03 23:04:42.206 +00:00] [INFO] [datacoord/services.go:730] ["received get flushed segments request"] [collectionID=439715925445053623] [partitionID=-1] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [INFO] [indexcoord/index_coord.go:588] ["IndexCoord completeIndexInfo"] [collID=439715925445053623] [indexName=_default_idx_101] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [DEBUG] [indexcoord/meta_table.go:651] ["IndexCoord get index states success"] [indexID=439715925445053665] [total=1] [None=0] [Unissued=0] [InProgress=0] [Finished=1] [Failed=0] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [DEBUG] [indexcoord/meta_table.go:728] ["IndexCoord get index states success"] [indexID=439715925445053665] [indexRows=56903] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [INFO] [indexcoord/index_coord.go:621] ["IndexCoord completeIndexInfo success"] [collID=439715925445053623] [totalRows=56903] [indexRows=56903] [state=Finished] [failReason=] milvus-standalone | [2023/03/03 23:04:42.210 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439807313437199999] [indexName=] [segIDs="[439833472321917563]"] milvus-standalone | [2023/03/03 23:04:42.212 +00:00] [INFO] [indexcoord/index_coord.go:829] ["IndexCoord DescribeIndex"] [collectionID=439807313437199999] [indexName=] milvus-standalone | [2023/03/03 23:04:42.213 +00:00] [INFO] [datacoord/services.go:730] ["received get flushed segments request"] [collectionID=439807313437199999] [partitionID=-1] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [INFO] [indexcoord/index_coord.go:588] ["IndexCoord completeIndexInfo"] [collID=439807313437199999] [indexName=_default_idx_101] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [DEBUG] [indexcoord/meta_table.go:651] ["IndexCoord get index states success"] [indexID=439807313437200027] [total=6] [None=0] [Unissued=0] [InProgress=0] [Finished=6] [Failed=0] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [DEBUG] [indexcoord/meta_table.go:728] ["IndexCoord get index states success"] [indexID=439807313437200027] [indexRows=97659] milvus-standalone | [2023/03/03 23:04:42.216 +00:00] [INFO] [indexcoord/index_coord.go:621] ["IndexCoord completeIndexInfo success"] [collID=439807313437199999] [totalRows=97659] [indexRows=97659] [state=Finished] [failReason=]
Anything else?
No response
@WeibingChen17 did you create index before or after inserting? if yes, it could be.
/assign @WeibingChen17 /unassign
I set a job to create index at 4am, insert 1000 entities every 15 min.
So there was an index building last night, and there will be another one tonight.
If it is true, why? Is that index out of sync with entities ? Should I create indices once all entities are inserted? Where did the document mention this?
/assign @yanliang567 /unassign
you don't need to create index every time after insert. Once you created index for a collection, milvus automatically build index for new inserted data. Besides, milvus will try to compact the small segments to bigger one until the segment size is close to the maxSize(512MB by default), the new compacted segments will be automatically built indices as well. Check https://milvus.io/docs/product_faq.md#Can-indexes-be-created-after-inserting-vectors
So how can I reduce the cpu usage? Thanks
So how can I reduce the cpu usage? Thanks
Maybe you can use pprof and perf tool to check which part is consuming memory
we'vealready enabled pprof server on milvus, open http://localhost:9876/debug/pprof in your browser and you would get some flame graph on go side. If it's not side consumption then you will have to install perf tools on your local docker.
OK. Thanks.
As I am using the standalone docker version, so I updated the docker-compose.xml to map 9876 to host 9876. But no luck to access it from other machine. "Unable to connect".
Try to use wget to fetch something, but no luck
--2023-03-03 23:57:23-- (try: 5) http://localhost:9876/debug/pprof Connecting to localhost (localhost)|::1|:9876... connected. HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers. Retrying.
Any suggestion on "have to install perf tools on your local docker"? Thanks!
Sorry about the mistake, seems the port is 9091
OK. The http port
However, still no luck. I changed the map to 9092 because 9091 is occupied by other service.
$ wget localhost:9092
--2023-03-04 10:41:27-- http://localhost:9092/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.
--2023-03-04 10:41:29-- (try: 2) http://localhost:9092/
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.
--2023-03-04 10:41:31-- (try: 3) http://localhost:9092/
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.
^C
$ lsof -i :9092
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
docker-pr 980708 root 4u IPv4 1716218 0t0 TCP *:9092 (LISTEN)
docker-pr 980716 root 4u IPv6 1717096 0t0 TCP *:9092 (LISTEN)
Sorry my bad. I did 9092:9092, which should be 9091:9092. After I updated it, I got this error:
$ curl localhost:9092/api/v1/health
curl: (56) Recv failure: Connection reset by peer
OK. Got it fixed. It should be 9092:9091. Thanks for your help.
But unfortunately no flame graph on the side
/debug/pprof/<br>
<br>
Types of profiles available:
Count | Profile
-- | --
368 | allocs
0 | block
0 | cmdline
573 | goroutine
368 | heap
0 | mutex
0 | profile
17 | threadcreate
0 | trace
OK.
$ go tool pprof http://localhost:9092/debug/pprof/profile?seconds=30
and upload the file here.
More details:
The heap
File: milvus
Build ID: a43638bb2a61d7c7130ca994643ae9fd205a64ac
Type: inuse_space
Time: Mar 4, 2023 at 11:57am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 66.81MB, 88.68% of 75.34MB total
Showing top 20 nodes out of 348
flat flat% sum% cum cum%
18.01MB 23.90% 23.90% 18.01MB 23.90% go.uber.org/zap/zapcore.newCounters
13.70MB 18.18% 42.08% 13.70MB 18.18% google.golang.org/protobuf/internal/impl.consumeStringSliceValidateUTF8
4.64MB 6.16% 48.24% 5.64MB 7.49% google.golang.org/protobuf/proto.MarshalOptions.marshal
4.02MB 5.34% 53.58% 4.02MB 5.34% github.com/milvus-io/milvus/internal/mq/msgstream.NewMqMsgStream
3.36MB 4.47% 58.05% 3.36MB 4.47% github.com/milvus-io/milvus/internal/util/typeutil.AppendFieldData
2.66MB 3.53% 61.58% 2.66MB 3.53% google.golang.org/grpc/internal/transport.newBufWriter
2.43MB 3.23% 64.81% 2.43MB 3.23% github.com/goccy/go-json/internal/decoder.init.0
2.41MB 3.20% 68.01% 7.28MB 9.66% github.com/milvus-io/milvus/internal/proxy.reduceRetrieveResults
2.06MB 2.74% 70.74% 2.06MB 2.74% bufio.NewReaderSize
1.70MB 2.26% 73.00% 1.70MB 2.26% github.com/klauspost/compress/zstd.(*blockDec).reset
1.60MB 2.12% 75.12% 1.60MB 2.12% github.com/milvus-io/milvus/internal/mq/msgstream/mqwrapper/rmq.(*rmqClient).Subscribe
1.56MB 2.07% 77.19% 1.56MB 2.07% bytes.makeSlice
1.56MB 2.07% 79.26% 1.56MB 2.07% github.com/milvus-io/milvus/internal/config.(*FileSource).loadFromFile
1.50MB 1.99% 81.25% 1.50MB 1.99% github.com/milvus-io/milvus/internal/util/typeutil.GetPK
1.33MB 1.76% 83.01% 1.33MB 1.76% github.com/goccy/go-json/internal/encoder.init.0
1.16MB 1.55% 84.56% 1.16MB 1.55% github.com/milvus-io/milvus/internal/allocator.(*CachedAllocator).Init
1.07MB 1.42% 85.98% 1.07MB 1.42% google.golang.org/protobuf/internal/strs.(*Builder).AppendFullName
1MB 1.33% 87.30% 1MB 1.33% regexp.onePassCopy
0.52MB 0.69% 87.99% 1.02MB 1.35% github.com/milvus-io/milvus/internal/config.(*Manager).pullSourceConfigs
0.51MB 0.68% 88.68% 0.51MB 0.68% io.ReadAll
Could it be due to my schema using a filepath as Id
def build_collection_if_not_exists(self) -> None:
if not utility.has_collection(self.collection_name):
dim = self.embedder.get_dim()
filename = FieldSchema(
name="filename",
dtype=DataType.VARCHAR,
max_length=1024,
is_primary=True,
)
embedding = FieldSchema(
name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim
)
schema = CollectionSchema(fields=[filename, embedding])
Collection(
name=self.collection_name, schema=schema, using="default", shards_num=1
)
self.build_index()
def build_index(self) -> None:
self.collection.release()
index_params = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {"nlist": 1024},
}
log_with_time(f"Start building index with {index_params}")
self.collection.create_index(field_name="embedding", index_params=index_params)
self.collection.load()
log_with_time(f"Finish building index")
As the stack you offered, most of the cpu is spent on rocksdb. Milvus will keep send timetick even without any write, and the time tick happens every 200ms per physical chanel You can either:
- change MQ implementation to pulsar or kafka
- reduce the rootCoord.dmlChannelNum to 32 or 64 if you don't have that much collections
Thanks, Xiaofan!
I also saw the rocksdb consume/deliver calls. But I still do not know why this tick can cause high cpu.
I am new to this tool, so I have no idea how to change these configurations in the docker container. If it is not related to my schema, others can meet the same issue using the standalone docker version. Do you have any plan to change this?
Thanks again!
By default the channel number is 256 since on large clusters with more nodes and collections it helps to improve performance.
For smaller instace change the default channel make sense to me.
@czs007 @congqixia @jiaoew1991 Do you think it's a good idea to change default channel number to 32 or 64? With the help of new msgstream dispatcher design, this might be a serious problem for most users.
By default the channel number is 256 since on large clusters with more nodes and collections it helps to improve performance.
For smaller instace change the default channel make sense to me.
@czs007 @congqixia @jiaoew1991 Do you think it's a good idea to change default channel number to 32 or 64? With the help of new msgstream dispatcher design, this might be a serious problem for most users.
It make sense to reduce default channel number. There is another improvement we could discuss that whether shall idle channels still need hight CPU usage.
One more thing: I spent some time migrating my schema to INT64-base primary key instead of VARCHAR(1024) primary. The CPU usage is now 6-10% most of time instead 12-16%. I attached the new pprof profile, and I saw the rocksmq was not the most heavy now pprof.milvus.samples.cpu.003.pb.gz
Just FYI. It could be a red herring.
Questions, did you write while you see the cpu usage? I've no idea why cpu usage would related to PK type
by the way, all the stack looks as expected for now. There seems not much we can do except for reduce channel size and sync frequency
No, I did not do any writing for that cpu usage. Do the message including information on the PK?
No, I did not do any writing for that cpu usage. Do the message including information on the PK?
did it help if you reduce shard number?
As you could see above, I used shards_num=1 for this collection.
As you could see above, I used shards_num=1 for this collection.
/assign @yanliang567
@yanliang567 is there anyway we can reproduce this in house?
not reproduce in house, memory and cpu keep stable
I guess this might be only reproducible on 2c8g machines or smaller. @WeibingChen17 what is your pod cpu limitation?
or did you change any default settings?
No, I did not make any change on the setting. I used docker-compose to launch the Milvus standalone instance, and did not change any configuration inside. The only change is the port 9091, which was occupied by other services.
My cpu and memory is
CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G
No, I did not make any change on the setting. I used docker-compose to launch the Milvus standalone instance, and did not change any configuration inside. The only change is the port 9091, which was occupied by other services.
My cpu and memory is
CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G
Did you have pprof for the cpu usage?