milvus [Bug]: CPU as high as 10% even no search/insert/query actions

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: Milvus 2.2.2
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.0
- OS(Ubuntu or CentOS): Debian 11.6
- CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G
- GPU: None
- Others:

Current Behavior

After I used docker-compose to start the standalone version of Milvus, the CPU keeps as high as 10% even no any action is taken.

Is this normal?

I just inserted 10,000 entities in the database. Is it buidling index?

Expected Behavior

when not used, the cpu should b low

Steps To Reproduce

No response

Milvus Log

It keeps printing this log

milvus-standalone | [2023/03/03 23:03:56.944 +00:00] [DEBUG] [datacoord/services.go:877] [DataCoord.GetMetrics] [node_id=133] [req="{"metric_type":"system_info"}"] [metric_type=system_info] [metrics="status:<> response:"{\"cluster\":{\"self\":{\"has_error\":false,\"error_reason\":\"\",\"name\":\"datacoord133\",\"hardware_infos\":{\"ip\":\"172.20.0.4:13333\",\"cpu_core_count\":4,\"cpu_core_usage\":0,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"system_info\":{\"system_version\":\"0fdc1a04\",\"deploy_mode\":\"STANDALONE\",\"build_version\":\"v2.2.2\",\"build_time\":\"Thu Dec 22 09:08:37 UTC 2022\",\"used_go_version\":\"go version go1.18.3 linux/amd64\"},\"created_time\":\"2023-03-03 22:58:42.215947162 +0000 UTC m=+4.471734750\",\"updated_time\":\"2023-03-03 22:58:42.215947492 +0000 UTC m=+4.471735031\",\"type\":\"datacoord\",\"id\":133,\"system_configurations\":{\"segment_max_size\":512},\"quota_metrics\":{\"TotalBinlogSize\":503362742}},\"connected_nodes\":[{\"has_error\":false,\"error_reason\":\"\",\"name\":\"datanode134\",\"hardware_infos\":{\"ip\":\"172.20.0.4:21124\",\"cpu_core_count\":4,\"cpu_core_usage\":100,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"system_info\":{\"system_version\":\"0fdc1a04\",\"deploy_mode\":\"STANDALONE\",\"build_version\":\"v2.2.2\",\"build_time\":\"Thu Dec 22 09:08:37 UTC 2022\",\"used_go_version\":\"go version go1.18.3 linux/amd64\"},\"created_time\":\"2023-03-03 22:58:42.401055749 +0000 UTC m=+4.656843300\",\"updated_time\":\"2023-03-03 22:58:42.401056126 +0000 UTC m=+4.656843681\",\"type\":\"datanode\",\"id\":134,\"system_configurations\":{\"flush_insert_buffer_size\":16777216},\"quota_metrics\":{\"Hms\":{\"ip\":\"172.20.0.4:21124\",\"cpu_core_count\":4,\"cpu_core_usage\":100,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"Rms\":[{\"Label\":\"InsertConsumeThroughput\",\"Rate\":0},{\"Label\":\"DeleteConsumeThroughput\",\"Rate\":0}],\"Fgm\":{\"MinFlowGraphChannel\":\"by-dev-rootcoord-dml_1_439715925445053623v0\",\"MinFlowGraphTt\":439847390212259841,\"NumFlowGraph\":2}}}]},\"connections\":{\"name\":\"datacoord133\",\"connected_components\":[]}}" component_name:"datacoord133" "] [] milvus-standalone | [2023/03/03 23:03:56.946 +00:00] [INFO] [proxy/multi_rate_limiter.go:96] ["RateLimiter setRates"] [rates="[{"rt":5,"r":1.7976931348623157e+308},{"rt":6,"r":1.7976931348623157e+308},{"rt":9,"r":1.7976931348623157e+308},{"rt":7,"r":1.7976931348623157e+308},{"rt":8,"r":1.7976931348623157e+308}]"] milvus-standalone | [2023/03/03 23:03:57.936 +00:00] [DEBUG] [querynode/flow_graph_service_time_node.go:70] ["update tSafe:"] [collectionID=439715925445053623] [tSafe=439847390474141697] [tSafe_p=2023/03/03 23:03:57.734 +00:00] [tsLag=202.635927ms] [channel=by-dev-rootcoord-dml_1_439715925445053623v0]

and

milvus-standalone | [2023/03/03 23:04:42.203 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439715925445053623] [indexName=] [segIDs="[439715925445288290]"] milvus-standalone | [2023/03/03 23:04:42.203 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439715925445053623] [indexName=] [segIDs="[439833472321716287]"] milvus-standalone | [2023/03/03 23:04:42.205 +00:00] [INFO] [indexcoord/index_coord.go:829] ["IndexCoord DescribeIndex"] [collectionID=439715925445053623] [indexName=] milvus-standalone | [2023/03/03 23:04:42.206 +00:00] [INFO] [datacoord/services.go:730] ["received get flushed segments request"] [collectionID=439715925445053623] [partitionID=-1] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [INFO] [indexcoord/index_coord.go:588] ["IndexCoord completeIndexInfo"] [collID=439715925445053623] [indexName=_default_idx_101] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [DEBUG] [indexcoord/meta_table.go:651] ["IndexCoord get index states success"] [indexID=439715925445053665] [total=1] [None=0] [Unissued=0] [InProgress=0] [Finished=1] [Failed=0] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [DEBUG] [indexcoord/meta_table.go:728] ["IndexCoord get index states success"] [indexID=439715925445053665] [indexRows=56903] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [INFO] [indexcoord/index_coord.go:621] ["IndexCoord completeIndexInfo success"] [collID=439715925445053623] [totalRows=56903] [indexRows=56903] [state=Finished] [failReason=] milvus-standalone | [2023/03/03 23:04:42.210 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439807313437199999] [indexName=] [segIDs="[439833472321917563]"] milvus-standalone | [2023/03/03 23:04:42.212 +00:00] [INFO] [indexcoord/index_coord.go:829] ["IndexCoord DescribeIndex"] [collectionID=439807313437199999] [indexName=] milvus-standalone | [2023/03/03 23:04:42.213 +00:00] [INFO] [datacoord/services.go:730] ["received get flushed segments request"] [collectionID=439807313437199999] [partitionID=-1] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [INFO] [indexcoord/index_coord.go:588] ["IndexCoord completeIndexInfo"] [collID=439807313437199999] [indexName=_default_idx_101] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [DEBUG] [indexcoord/meta_table.go:651] ["IndexCoord get index states success"] [indexID=439807313437200027] [total=6] [None=0] [Unissued=0] [InProgress=0] [Finished=6] [Failed=0] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [DEBUG] [indexcoord/meta_table.go:728] ["IndexCoord get index states success"] [indexID=439807313437200027] [indexRows=97659] milvus-standalone | [2023/03/03 23:04:42.216 +00:00] [INFO] [indexcoord/index_coord.go:621] ["IndexCoord completeIndexInfo success"] [collID=439807313437199999] [totalRows=97659] [indexRows=97659] [state=Finished] [failReason=]

Anything else?

No response

Mar 03 '23 23:03 WeibingChen17

@WeibingChen17 did you create index before or after inserting? if yes, it could be.

/assign @WeibingChen17 /unassign

Mar 04 '23 02:03 yanliang567

I set a job to create index at 4am, insert 1000 entities every 15 min.

So there was an index building last night, and there will be another one tonight.

If it is true, why? Is that index out of sync with entities ? Should I create indices once all entities are inserted? Where did the document mention this?

/assign @yanliang567 /unassign

Mar 04 '23 02:03 WeibingChen17

you don't need to create index every time after insert. Once you created index for a collection, milvus automatically build index for new inserted data. Besides, milvus will try to compact the small segments to bigger one until the segment size is close to the maxSize(512MB by default), the new compacted segments will be automatically built indices as well. Check https://milvus.io/docs/product_faq.md#Can-indexes-be-created-after-inserting-vectors

Mar 04 '23 02:03 yanliang567

So how can I reduce the cpu usage? Thanks

Mar 04 '23 04:03 WeibingChen17

So how can I reduce the cpu usage? Thanks

Maybe you can use pprof and perf tool to check which part is consuming memory

Mar 04 '23 07:03 xiaofan-luan

we'vealready enabled pprof server on milvus, open http://localhost:9876/debug/pprof in your browser and you would get some flame graph on go side. If it's not side consumption then you will have to install perf tools on your local docker.

Mar 04 '23 07:03 xiaofan-luan

OK. Thanks.

As I am using the standalone docker version, so I updated the docker-compose.xml to map 9876 to host 9876. But no luck to access it from other machine. "Unable to connect".

Try to use wget to fetch something, but no luck

--2023-03-03 23:57:23-- (try: 5) http://localhost:9876/debug/pprof Connecting to localhost (localhost)|::1|:9876... connected. HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers. Retrying.

Any suggestion on "have to install perf tools on your local docker"? Thanks!

Mar 04 '23 07:03 WeibingChen17

Sorry about the mistake, seems the port is 9091

Mar 04 '23 14:03 xiaofan-luan

OK. The http port

However, still no luck. I changed the map to 9092 because 9091 is occupied by other service.

$ wget localhost:9092
--2023-03-04 10:41:27--  http://localhost:9092/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

--2023-03-04 10:41:29--  (try: 2)  http://localhost:9092/
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

--2023-03-04 10:41:31--  (try: 3)  http://localhost:9092/
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

^C

$ lsof -i :9092
COMMAND      PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
docker-pr 980708 root    4u  IPv4 1716218      0t0  TCP *:9092 (LISTEN)
docker-pr 980716 root    4u  IPv6 1717096      0t0  TCP *:9092 (LISTEN)

Mar 04 '23 18:03 WeibingChen17

Sorry my bad. I did 9092:9092, which should be 9091:9092. After I updated it, I got this error:

$ curl localhost:9092/api/v1/health
curl: (56) Recv failure: Connection reset by peer

Mar 04 '23 18:03 WeibingChen17

OK. Got it fixed. It should be 9092:9091. Thanks for your help.

But unfortunately no flame graph on the side

/debug/pprof/<br>
<br>
Types of profiles available:


Count | Profile
-- | --
368 | allocs
0 | block
0 | cmdline
573 | goroutine
368 | heap
0 | mutex
0 | profile
17 | threadcreate
0 | trace

Mar 04 '23 19:03 WeibingChen17

OK.

$ go tool pprof http://localhost:9092/debug/pprof/profile?seconds=30

and upload the file here.

pprof.milvus.samples.cpu.001.pb.gz

Mar 04 '23 19:03 WeibingChen17

More details:

The heap

File: milvus
Build ID: a43638bb2a61d7c7130ca994643ae9fd205a64ac
Type: inuse_space
Time: Mar 4, 2023 at 11:57am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 66.81MB, 88.68% of 75.34MB total
Showing top 20 nodes out of 348
      flat  flat%   sum%        cum   cum%
   18.01MB 23.90% 23.90%    18.01MB 23.90%  go.uber.org/zap/zapcore.newCounters
   13.70MB 18.18% 42.08%    13.70MB 18.18%  google.golang.org/protobuf/internal/impl.consumeStringSliceValidateUTF8
    4.64MB  6.16% 48.24%     5.64MB  7.49%  google.golang.org/protobuf/proto.MarshalOptions.marshal
    4.02MB  5.34% 53.58%     4.02MB  5.34%  github.com/milvus-io/milvus/internal/mq/msgstream.NewMqMsgStream
    3.36MB  4.47% 58.05%     3.36MB  4.47%  github.com/milvus-io/milvus/internal/util/typeutil.AppendFieldData
    2.66MB  3.53% 61.58%     2.66MB  3.53%  google.golang.org/grpc/internal/transport.newBufWriter
    2.43MB  3.23% 64.81%     2.43MB  3.23%  github.com/goccy/go-json/internal/decoder.init.0
    2.41MB  3.20% 68.01%     7.28MB  9.66%  github.com/milvus-io/milvus/internal/proxy.reduceRetrieveResults
    2.06MB  2.74% 70.74%     2.06MB  2.74%  bufio.NewReaderSize
    1.70MB  2.26% 73.00%     1.70MB  2.26%  github.com/klauspost/compress/zstd.(*blockDec).reset
    1.60MB  2.12% 75.12%     1.60MB  2.12%  github.com/milvus-io/milvus/internal/mq/msgstream/mqwrapper/rmq.(*rmqClient).Subscribe
    1.56MB  2.07% 77.19%     1.56MB  2.07%  bytes.makeSlice
    1.56MB  2.07% 79.26%     1.56MB  2.07%  github.com/milvus-io/milvus/internal/config.(*FileSource).loadFromFile
    1.50MB  1.99% 81.25%     1.50MB  1.99%  github.com/milvus-io/milvus/internal/util/typeutil.GetPK
    1.33MB  1.76% 83.01%     1.33MB  1.76%  github.com/goccy/go-json/internal/encoder.init.0
    1.16MB  1.55% 84.56%     1.16MB  1.55%  github.com/milvus-io/milvus/internal/allocator.(*CachedAllocator).Init
    1.07MB  1.42% 85.98%     1.07MB  1.42%  google.golang.org/protobuf/internal/strs.(*Builder).AppendFullName
       1MB  1.33% 87.30%        1MB  1.33%  regexp.onePassCopy
    0.52MB  0.69% 87.99%     1.02MB  1.35%  github.com/milvus-io/milvus/internal/config.(*Manager).pullSourceConfigs
    0.51MB  0.68% 88.68%     0.51MB  0.68%  io.ReadAll

Could it be due to my schema using a filepath as Id

    def build_collection_if_not_exists(self) -> None:
        if not utility.has_collection(self.collection_name):
            dim = self.embedder.get_dim()
            filename = FieldSchema(
                name="filename",
                dtype=DataType.VARCHAR,
                max_length=1024,
                is_primary=True,
            )
            embedding = FieldSchema(
                name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim
            )
            schema = CollectionSchema(fields=[filename, embedding])
            Collection(
                name=self.collection_name, schema=schema, using="default", shards_num=1
            )
            self.build_index()

    def build_index(self) -> None:
        self.collection.release()
        index_params = {
            "metric_type": "L2",
            "index_type": "IVF_FLAT",
            "params": {"nlist": 1024},
        }
        log_with_time(f"Start building index with {index_params}")
        self.collection.create_index(field_name="embedding", index_params=index_params)
        self.collection.load()
        log_with_time(f"Finish building index")

Mar 04 '23 20:03 WeibingChen17

As the stack you offered, most of the cpu is spent on rocksdb. Milvus will keep send timetick even without any write, and the time tick happens every 200ms per physical chanel You can either:

change MQ implementation to pulsar or kafka
reduce the rootCoord.dmlChannelNum to 32 or 64 if you don't have that much collections

Mar 05 '23 03:03 xiaofan-luan

Thanks, Xiaofan!

I also saw the rocksdb consume/deliver calls. But I still do not know why this tick can cause high cpu.

I am new to this tool, so I have no idea how to change these configurations in the docker container. If it is not related to my schema, others can meet the same issue using the standalone docker version. Do you have any plan to change this?

Thanks again!

Mar 05 '23 03:03 WeibingChen17

By default the channel number is 256 since on large clusters with more nodes and collections it helps to improve performance.

For smaller instace change the default channel make sense to me.

@czs007 @congqixia @jiaoew1991 Do you think it's a good idea to change default channel number to 32 or 64? With the help of new msgstream dispatcher design, this might be a serious problem for most users.

Mar 05 '23 07:03 xiaofan-luan

By default the channel number is 256 since on large clusters with more nodes and collections it helps to improve performance.

For smaller instace change the default channel make sense to me.

@czs007 @congqixia @jiaoew1991 Do you think it's a good idea to change default channel number to 32 or 64? With the help of new msgstream dispatcher design, this might be a serious problem for most users.

It make sense to reduce default channel number. There is another improvement we could discuss that whether shall idle channels still need hight CPU usage.

Mar 05 '23 08:03 congqixia

One more thing: I spent some time migrating my schema to INT64-base primary key instead of VARCHAR(1024) primary. The CPU usage is now 6-10% most of time instead 12-16%. I attached the new pprof profile, and I saw the rocksmq was not the most heavy now pprof.milvus.samples.cpu.003.pb.gz

Just FYI. It could be a red herring.

Mar 05 '23 19:03 WeibingChen17

Questions, did you write while you see the cpu usage? I've no idea why cpu usage would related to PK type

Mar 06 '23 09:03 xiaofan-luan

by the way, all the stack looks as expected for now. There seems not much we can do except for reduce channel size and sync frequency

Mar 06 '23 09:03 xiaofan-luan

No, I did not do any writing for that cpu usage. Do the message including information on the PK?

Mar 07 '23 06:03 WeibingChen17

No, I did not do any writing for that cpu usage. Do the message including information on the PK?

did it help if you reduce shard number?

Mar 10 '23 10:03 xiaofan-luan

As you could see above, I used shards_num=1 for this collection.

Mar 10 '23 21:03 WeibingChen17

As you could see above, I used shards_num=1 for this collection.

/assign @yanliang567

Mar 11 '23 12:03 xiaofan-luan

@yanliang567 is there anyway we can reproduce this in house?

Mar 11 '23 12:03 xiaofan-luan

not reproduce in house, memory and cpu keep stable

Mar 20 '23 02:03 yanliang567

I guess this might be only reproducible on 2c8g machines or smaller. @WeibingChen17 what is your pod cpu limitation?

Mar 20 '23 13:03 xiaofan-luan

or did you change any default settings?

Mar 20 '23 13:03 xiaofan-luan

No, I did not make any change on the setting. I used docker-compose to launch the Milvus standalone instance, and did not change any configuration inside. The only change is the port 9091, which was occupied by other services.

My cpu and memory is

CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G

Mar 22 '23 06:03 WeibingChen17

No, I did not make any change on the setting. I used docker-compose to launch the Milvus standalone instance, and did not change any configuration inside. The only change is the port 9091, which was occupied by other services.

My cpu and memory is

CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G

Did you have pprof for the cpu usage?

Mar 22 '23 07:03 xiaofan-luan

milvus milvus copied to clipboard

[Bug]: CPU as high as 10% even no search/insert/query actions

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard