milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: CPU as high as 10% even no search/insert/query actions

Open WeibingChen17 opened this issue 1 year ago • 41 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: Milvus 2.2.2
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.0
- OS(Ubuntu or CentOS): Debian 11.6
- CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G
- GPU: None
- Others:

Current Behavior

After I used docker-compose to start the standalone version of Milvus, the CPU keeps as high as 10% even no any action is taken.

Is this normal?

I just inserted 10,000 entities in the database. Is it buidling index?

Expected Behavior

when not used, the cpu should b low

Steps To Reproduce

No response

Milvus Log

It keeps printing this log

milvus-standalone | [2023/03/03 23:03:56.944 +00:00] [DEBUG] [datacoord/services.go:877] [DataCoord.GetMetrics] [node_id=133] [req="{"metric_type":"system_info"}"] [metric_type=system_info] [metrics="status:<> response:"{\"cluster\":{\"self\":{\"has_error\":false,\"error_reason\":\"\",\"name\":\"datacoord133\",\"hardware_infos\":{\"ip\":\"172.20.0.4:13333\",\"cpu_core_count\":4,\"cpu_core_usage\":0,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"system_info\":{\"system_version\":\"0fdc1a04\",\"deploy_mode\":\"STANDALONE\",\"build_version\":\"v2.2.2\",\"build_time\":\"Thu Dec 22 09:08:37 UTC 2022\",\"used_go_version\":\"go version go1.18.3 linux/amd64\"},\"created_time\":\"2023-03-03 22:58:42.215947162 +0000 UTC m=+4.471734750\",\"updated_time\":\"2023-03-03 22:58:42.215947492 +0000 UTC m=+4.471735031\",\"type\":\"datacoord\",\"id\":133,\"system_configurations\":{\"segment_max_size\":512},\"quota_metrics\":{\"TotalBinlogSize\":503362742}},\"connected_nodes\":[{\"has_error\":false,\"error_reason\":\"\",\"name\":\"datanode134\",\"hardware_infos\":{\"ip\":\"172.20.0.4:21124\",\"cpu_core_count\":4,\"cpu_core_usage\":100,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"system_info\":{\"system_version\":\"0fdc1a04\",\"deploy_mode\":\"STANDALONE\",\"build_version\":\"v2.2.2\",\"build_time\":\"Thu Dec 22 09:08:37 UTC 2022\",\"used_go_version\":\"go version go1.18.3 linux/amd64\"},\"created_time\":\"2023-03-03 22:58:42.401055749 +0000 UTC m=+4.656843300\",\"updated_time\":\"2023-03-03 22:58:42.401056126 +0000 UTC m=+4.656843681\",\"type\":\"datanode\",\"id\":134,\"system_configurations\":{\"flush_insert_buffer_size\":16777216},\"quota_metrics\":{\"Hms\":{\"ip\":\"172.20.0.4:21124\",\"cpu_core_count\":4,\"cpu_core_usage\":100,\"memory\":16663109632,\"memory_usage\":3798999040,\"disk\":104857600,\"disk_usage\":2097152},\"Rms\":[{\"Label\":\"InsertConsumeThroughput\",\"Rate\":0},{\"Label\":\"DeleteConsumeThroughput\",\"Rate\":0}],\"Fgm\":{\"MinFlowGraphChannel\":\"by-dev-rootcoord-dml_1_439715925445053623v0\",\"MinFlowGraphTt\":439847390212259841,\"NumFlowGraph\":2}}}]},\"connections\":{\"name\":\"datacoord133\",\"connected_components\":[]}}" component_name:"datacoord133" "] [] milvus-standalone | [2023/03/03 23:03:56.946 +00:00] [INFO] [proxy/multi_rate_limiter.go:96] ["RateLimiter setRates"] [rates="[{"rt":5,"r":1.7976931348623157e+308},{"rt":6,"r":1.7976931348623157e+308},{"rt":9,"r":1.7976931348623157e+308},{"rt":7,"r":1.7976931348623157e+308},{"rt":8,"r":1.7976931348623157e+308}]"] milvus-standalone | [2023/03/03 23:03:57.936 +00:00] [DEBUG] [querynode/flow_graph_service_time_node.go:70] ["update tSafe:"] [collectionID=439715925445053623] [tSafe=439847390474141697] [tSafe_p=2023/03/03 23:03:57.734 +00:00] [tsLag=202.635927ms] [channel=by-dev-rootcoord-dml_1_439715925445053623v0]

and

milvus-standalone | [2023/03/03 23:04:42.203 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439715925445053623] [indexName=] [segIDs="[439715925445288290]"] milvus-standalone | [2023/03/03 23:04:42.203 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439715925445053623] [indexName=] [segIDs="[439833472321716287]"] milvus-standalone | [2023/03/03 23:04:42.205 +00:00] [INFO] [indexcoord/index_coord.go:829] ["IndexCoord DescribeIndex"] [collectionID=439715925445053623] [indexName=] milvus-standalone | [2023/03/03 23:04:42.206 +00:00] [INFO] [datacoord/services.go:730] ["received get flushed segments request"] [collectionID=439715925445053623] [partitionID=-1] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [INFO] [indexcoord/index_coord.go:588] ["IndexCoord completeIndexInfo"] [collID=439715925445053623] [indexName=_default_idx_101] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [DEBUG] [indexcoord/meta_table.go:651] ["IndexCoord get index states success"] [indexID=439715925445053665] [total=1] [None=0] [Unissued=0] [InProgress=0] [Finished=1] [Failed=0] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [DEBUG] [indexcoord/meta_table.go:728] ["IndexCoord get index states success"] [indexID=439715925445053665] [indexRows=56903] milvus-standalone | [2023/03/03 23:04:42.208 +00:00] [INFO] [indexcoord/index_coord.go:621] ["IndexCoord completeIndexInfo success"] [collID=439715925445053623] [totalRows=56903] [indexRows=56903] [state=Finished] [failReason=] milvus-standalone | [2023/03/03 23:04:42.210 +00:00] [DEBUG] [indexcoord/index_coord.go:772] ["IndexCoord GetIndexInfos"] [collectionID=439807313437199999] [indexName=] [segIDs="[439833472321917563]"] milvus-standalone | [2023/03/03 23:04:42.212 +00:00] [INFO] [indexcoord/index_coord.go:829] ["IndexCoord DescribeIndex"] [collectionID=439807313437199999] [indexName=] milvus-standalone | [2023/03/03 23:04:42.213 +00:00] [INFO] [datacoord/services.go:730] ["received get flushed segments request"] [collectionID=439807313437199999] [partitionID=-1] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [INFO] [indexcoord/index_coord.go:588] ["IndexCoord completeIndexInfo"] [collID=439807313437199999] [indexName=_default_idx_101] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [DEBUG] [indexcoord/meta_table.go:651] ["IndexCoord get index states success"] [indexID=439807313437200027] [total=6] [None=0] [Unissued=0] [InProgress=0] [Finished=6] [Failed=0] milvus-standalone | [2023/03/03 23:04:42.215 +00:00] [DEBUG] [indexcoord/meta_table.go:728] ["IndexCoord get index states success"] [indexID=439807313437200027] [indexRows=97659] milvus-standalone | [2023/03/03 23:04:42.216 +00:00] [INFO] [indexcoord/index_coord.go:621] ["IndexCoord completeIndexInfo success"] [collID=439807313437199999] [totalRows=97659] [indexRows=97659] [state=Finished] [failReason=]

Anything else?

No response

WeibingChen17 avatar Mar 03 '23 23:03 WeibingChen17

@WeibingChen17 did you create index before or after inserting? if yes, it could be.

/assign @WeibingChen17 /unassign

yanliang567 avatar Mar 04 '23 02:03 yanliang567

I set a job to create index at 4am, insert 1000 entities every 15 min.

So there was an index building last night, and there will be another one tonight.

If it is true, why? Is that index out of sync with entities ? Should I create indices once all entities are inserted? Where did the document mention this?

/assign @yanliang567 /unassign

WeibingChen17 avatar Mar 04 '23 02:03 WeibingChen17

you don't need to create index every time after insert. Once you created index for a collection, milvus automatically build index for new inserted data. Besides, milvus will try to compact the small segments to bigger one until the segment size is close to the maxSize(512MB by default), the new compacted segments will be automatically built indices as well. Check https://milvus.io/docs/product_faq.md#Can-indexes-be-created-after-inserting-vectors

yanliang567 avatar Mar 04 '23 02:03 yanliang567

So how can I reduce the cpu usage? Thanks 

WeibingChen17 avatar Mar 04 '23 04:03 WeibingChen17

So how can I reduce the cpu usage? Thanks

Maybe you can use pprof and perf tool to check which part is consuming memory

xiaofan-luan avatar Mar 04 '23 07:03 xiaofan-luan

we'vealready enabled pprof server on milvus, open http://localhost:9876/debug/pprof in your browser and you would get some flame graph on go side. If it's not side consumption then you will have to install perf tools on your local docker.

xiaofan-luan avatar Mar 04 '23 07:03 xiaofan-luan

OK. Thanks.

As I am using the standalone docker version, so I updated the docker-compose.xml to map 9876 to host 9876. But no luck to access it from other machine. "Unable to connect".

Try to use wget to fetch something, but no luck

--2023-03-03 23:57:23-- (try: 5) http://localhost:9876/debug/pprof Connecting to localhost (localhost)|::1|:9876... connected. HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers. Retrying.

Any suggestion on "have to install perf tools on your local docker"? Thanks!

WeibingChen17 avatar Mar 04 '23 07:03 WeibingChen17

Sorry about the mistake, seems the port is 9091

xiaofan-luan avatar Mar 04 '23 14:03 xiaofan-luan

OK. The http port

However, still no luck. I changed the map to 9092 because 9091 is occupied by other service.

$ wget localhost:9092
--2023-03-04 10:41:27--  http://localhost:9092/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

--2023-03-04 10:41:29--  (try: 2)  http://localhost:9092/
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

--2023-03-04 10:41:31--  (try: 3)  http://localhost:9092/
Connecting to localhost (localhost)|::1|:9092... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

^C

$ lsof -i :9092
COMMAND      PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
docker-pr 980708 root    4u  IPv4 1716218      0t0  TCP *:9092 (LISTEN)
docker-pr 980716 root    4u  IPv6 1717096      0t0  TCP *:9092 (LISTEN)

WeibingChen17 avatar Mar 04 '23 18:03 WeibingChen17

Sorry my bad. I did 9092:9092, which should be 9091:9092. After I updated it, I got this error:

$ curl localhost:9092/api/v1/health
curl: (56) Recv failure: Connection reset by peer

WeibingChen17 avatar Mar 04 '23 18:03 WeibingChen17

OK. Got it fixed. It should be 9092:9091. Thanks for your help.

But unfortunately no flame graph on the side

/debug/pprof/<br>
<br>
Types of profiles available:


Count | Profile
-- | --
368 | allocs
0 | block
0 | cmdline
573 | goroutine
368 | heap
0 | mutex
0 | profile
17 | threadcreate
0 | trace

WeibingChen17 avatar Mar 04 '23 19:03 WeibingChen17

OK.

$ go tool pprof http://localhost:9092/debug/pprof/profile?seconds=30

and upload the file here.

pprof.milvus.samples.cpu.001.pb.gz

WeibingChen17 avatar Mar 04 '23 19:03 WeibingChen17

More details:

The heap

File: milvus
Build ID: a43638bb2a61d7c7130ca994643ae9fd205a64ac
Type: inuse_space
Time: Mar 4, 2023 at 11:57am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 66.81MB, 88.68% of 75.34MB total
Showing top 20 nodes out of 348
      flat  flat%   sum%        cum   cum%
   18.01MB 23.90% 23.90%    18.01MB 23.90%  go.uber.org/zap/zapcore.newCounters
   13.70MB 18.18% 42.08%    13.70MB 18.18%  google.golang.org/protobuf/internal/impl.consumeStringSliceValidateUTF8
    4.64MB  6.16% 48.24%     5.64MB  7.49%  google.golang.org/protobuf/proto.MarshalOptions.marshal
    4.02MB  5.34% 53.58%     4.02MB  5.34%  github.com/milvus-io/milvus/internal/mq/msgstream.NewMqMsgStream
    3.36MB  4.47% 58.05%     3.36MB  4.47%  github.com/milvus-io/milvus/internal/util/typeutil.AppendFieldData
    2.66MB  3.53% 61.58%     2.66MB  3.53%  google.golang.org/grpc/internal/transport.newBufWriter
    2.43MB  3.23% 64.81%     2.43MB  3.23%  github.com/goccy/go-json/internal/decoder.init.0
    2.41MB  3.20% 68.01%     7.28MB  9.66%  github.com/milvus-io/milvus/internal/proxy.reduceRetrieveResults
    2.06MB  2.74% 70.74%     2.06MB  2.74%  bufio.NewReaderSize
    1.70MB  2.26% 73.00%     1.70MB  2.26%  github.com/klauspost/compress/zstd.(*blockDec).reset
    1.60MB  2.12% 75.12%     1.60MB  2.12%  github.com/milvus-io/milvus/internal/mq/msgstream/mqwrapper/rmq.(*rmqClient).Subscribe
    1.56MB  2.07% 77.19%     1.56MB  2.07%  bytes.makeSlice
    1.56MB  2.07% 79.26%     1.56MB  2.07%  github.com/milvus-io/milvus/internal/config.(*FileSource).loadFromFile
    1.50MB  1.99% 81.25%     1.50MB  1.99%  github.com/milvus-io/milvus/internal/util/typeutil.GetPK
    1.33MB  1.76% 83.01%     1.33MB  1.76%  github.com/goccy/go-json/internal/encoder.init.0
    1.16MB  1.55% 84.56%     1.16MB  1.55%  github.com/milvus-io/milvus/internal/allocator.(*CachedAllocator).Init
    1.07MB  1.42% 85.98%     1.07MB  1.42%  google.golang.org/protobuf/internal/strs.(*Builder).AppendFullName
       1MB  1.33% 87.30%        1MB  1.33%  regexp.onePassCopy
    0.52MB  0.69% 87.99%     1.02MB  1.35%  github.com/milvus-io/milvus/internal/config.(*Manager).pullSourceConfigs
    0.51MB  0.68% 88.68%     0.51MB  0.68%  io.ReadAll

Could it be due to my schema using a filepath as Id

    def build_collection_if_not_exists(self) -> None:
        if not utility.has_collection(self.collection_name):
            dim = self.embedder.get_dim()
            filename = FieldSchema(
                name="filename",
                dtype=DataType.VARCHAR,
                max_length=1024,
                is_primary=True,
            )
            embedding = FieldSchema(
                name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim
            )
            schema = CollectionSchema(fields=[filename, embedding])
            Collection(
                name=self.collection_name, schema=schema, using="default", shards_num=1
            )
            self.build_index()

    def build_index(self) -> None:
        self.collection.release()
        index_params = {
            "metric_type": "L2",
            "index_type": "IVF_FLAT",
            "params": {"nlist": 1024},
        }
        log_with_time(f"Start building index with {index_params}")
        self.collection.create_index(field_name="embedding", index_params=index_params)
        self.collection.load()
        log_with_time(f"Finish building index")

WeibingChen17 avatar Mar 04 '23 20:03 WeibingChen17

As the stack you offered, most of the cpu is spent on rocksdb. Milvus will keep send timetick even without any write, and the time tick happens every 200ms per physical chanel You can either:

  1. change MQ implementation to pulsar or kafka
  2. reduce the rootCoord.dmlChannelNum to 32 or 64 if you don't have that much collections

xiaofan-luan avatar Mar 05 '23 03:03 xiaofan-luan

Thanks, Xiaofan!

I also saw the rocksdb consume/deliver calls. But I still do not know why this tick can cause high cpu.

I am new to this tool, so I have no idea how to change these configurations in the docker container. If it is not related to my schema, others can meet the same issue using the standalone docker version. Do you have any plan to change this?

Thanks again!

WeibingChen17 avatar Mar 05 '23 03:03 WeibingChen17

By default the channel number is 256 since on large clusters with more nodes and collections it helps to improve performance.

For smaller instace change the default channel make sense to me.

@czs007 @congqixia @jiaoew1991 Do you think it's a good idea to change default channel number to 32 or 64? With the help of new msgstream dispatcher design, this might be a serious problem for most users.

xiaofan-luan avatar Mar 05 '23 07:03 xiaofan-luan

By default the channel number is 256 since on large clusters with more nodes and collections it helps to improve performance.

For smaller instace change the default channel make sense to me.

@czs007 @congqixia @jiaoew1991 Do you think it's a good idea to change default channel number to 32 or 64? With the help of new msgstream dispatcher design, this might be a serious problem for most users.

It make sense to reduce default channel number. There is another improvement we could discuss that whether shall idle channels still need hight CPU usage.

congqixia avatar Mar 05 '23 08:03 congqixia

One more thing: I spent some time migrating my schema to INT64-base primary key instead of VARCHAR(1024) primary. The CPU usage is now 6-10% most of time instead 12-16%. I attached the new pprof profile, and I saw the rocksmq was not the most heavy now pprof.milvus.samples.cpu.003.pb.gz

Just FYI. It could be a red herring.

WeibingChen17 avatar Mar 05 '23 19:03 WeibingChen17

Questions, did you write while you see the cpu usage? I've no idea why cpu usage would related to PK type

xiaofan-luan avatar Mar 06 '23 09:03 xiaofan-luan

by the way, all the stack looks as expected for now. There seems not much we can do except for reduce channel size and sync frequency

xiaofan-luan avatar Mar 06 '23 09:03 xiaofan-luan

No, I did not do any writing for that cpu usage. Do the message including information on the PK?

WeibingChen17 avatar Mar 07 '23 06:03 WeibingChen17

No, I did not do any writing for that cpu usage. Do the message including information on the PK?

did it help if you reduce shard number?

xiaofan-luan avatar Mar 10 '23 10:03 xiaofan-luan

As you could see above, I used shards_num=1 for this collection.

WeibingChen17 avatar Mar 10 '23 21:03 WeibingChen17

As you could see above, I used shards_num=1 for this collection.

/assign @yanliang567

xiaofan-luan avatar Mar 11 '23 12:03 xiaofan-luan

@yanliang567 is there anyway we can reproduce this in house?

xiaofan-luan avatar Mar 11 '23 12:03 xiaofan-luan

not reproduce in house, memory and cpu keep stable image

yanliang567 avatar Mar 20 '23 02:03 yanliang567

I guess this might be only reproducible on 2c8g machines or smaller. @WeibingChen17 what is your pod cpu limitation?

xiaofan-luan avatar Mar 20 '23 13:03 xiaofan-luan

or did you change any default settings?

xiaofan-luan avatar Mar 20 '23 13:03 xiaofan-luan

No, I did not make any change on the setting. I used docker-compose to launch the Milvus standalone instance, and did not change any configuration inside. The only change is the port 9091, which was occupied by other services.

My cpu and memory is

CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G

WeibingChen17 avatar Mar 22 '23 06:03 WeibingChen17

No, I did not make any change on the setting. I used docker-compose to launch the Milvus standalone instance, and did not change any configuration inside. The only change is the port 9091, which was occupied by other services.

My cpu and memory is

CPU/Memory: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz / 16G

Did you have pprof for the cpu usage?

xiaofan-luan avatar Mar 22 '23 07:03 xiaofan-luan