milvus [Bug]: complete comapction wrong: etcdserver: request is too large

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: milvus-dev:2.1.0-latest(image id:9f3ff7f688fc)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 16c/64G
- GPU: 无
- Others:

Current Behavior

Use docker-compose to deploy version 2.0.2 of milvus on one server
The number of querynode and datanode nodes is adjusted to 2
There was a problem last Monday morning, 1 datanode node hung up, and the error message appeared when viewing the error log information: ResourceExhausted desc=trying to send message larger than max (3123942 vs. 2097152) and other information
The above larger than error keeps appearing after restarting the service
Use milvus-dev:2.1.0-latest(image id:9f3ff7f688fc) image, the service restarts successfully
There was a problem this Monday, one querynode node hung up, and some collections could not be queried. The specific error message: failed to complete compaction etcdserver: request is too large and other information
After restarting the service, no node hangs up, but it is still in a state that some collections cannot be queried

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

Jul 05 '22 07:07 p363796822

Is there an existing issue for this?

[x] I have searched the existing issues

Environment
- Milvus version: milvus-dev:2.1.0-latest(image id:9f3ff7f688fc)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 16c/64G
- GPU: 无
- Others:
Current Behavior

Use docker-compose to deploy version 2.0.2 of milvus on one server

The number of querynode and datanode nodes is adjusted to 2

There was a problem last Monday morning, 1 datanode node hung up, and the error message appeared when viewing the error log information: ResourceExhausted desc=trying to send message larger than max (3123942 vs. 2097152) and other information

The above larger than error keeps appearing after restarting the service

Use milvus-dev:2.1.0-latest(image id:9f3ff7f688fc) image, the service restarts successfully

There was a problem this Monday, one querynode node hung up, and some collections could not be queried. The specific error message: failed to complete compaction etcdserver: request is too large and other information

After restarting the service, no node hangs up, but it is still in a state that some collections cannot be queried

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

Hi @p363796822 did you delete in your user cases? We just fix a issue about compaction, if delete too much there will be many delta log paths, cause the segment to be really large. If you are still under testing, drop the collection would work. Otherwise we have to think of a hack way to recover

Jul 05 '22 07:07 xiaofan-luan

The data is not deleted, but we are writing and checking at the same time. The error reported last Monday happened suddenly. After it happened, the 2.0.2 version could not be started. After trying various methods, I finally thought of using our latest version to start
The data is no longer test data, but data generated in production. There is no way to simply delete the collection to solve the problem.

Jul 05 '22 08:07 p363796822

The data is not deleted, but we are writing and checking at the same time. The error reported last Monday happened suddenly. After it happened, the 2.0.2 version could not be started. After trying various methods, I finally thought of using our latest version to start

The data is no longer test data, but data generated in production. There is no way to simply delete the collection to solve the problem.

do we have delete or frequently num_entity/flush calls?
if you can change the etcd config to relax the 2M limit to 5M, then later on after compaction the problem should be atomatically resolved, it this a way to fix?

Jul 05 '22 08:07 xiaofan-luan

There is no num_entity/flush related code in the code, and the data has not been deleted after writing
How to modify the etcd configuration? I tried adding the configuration to the docker-compose file and the service configuration file milvus.yaml before on version 2.0.2, but it didn't take effect. Can you give me a little more details? The process of modifying etcd configuration points, thank you

Jul 05 '22 09:07 p363796822

There is no num_entity/flush related code in the code, and the data has not been deleted after writing

How to modify the etcd configuration? I tried adding the configuration to the docker-compose file and the service configuration file milvus.yaml before on version 2.0.2, but it didn't take effect. Can you give me a little more details? The process of modifying etcd configuration points, thank you

could your offer your log? we need more log to investigate on why because if no frequent delete or flush this issues should not happen

Jul 05 '22 12:07 xiaofan-luan

@p363796822 Could you please refer this script to export the whole Milvus logs for investigation?

/assign @p363796822 /unassign

Jul 06 '22 01:07 yanliang567

All the logs are too big. I will export the log information of the day before and after the error report. I will upload it here and see if it can help the investigation. If not, I will export more logs here.

Link: https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q Extraction code: xe5q After copying this content, open the Baidu SkyDrive mobile app, the operation is more convenient

Jul 06 '22 01:07 p363796822

@xiaofan-luan could you please have someone take a look at this issue? I think we are doing some improvement for etcd request size.

Jul 06 '22 02:07 yanliang567

https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is

Jul 06 '22 02:07 p363796822

https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is

I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.

Jul 11 '22 05:07 wayblink

https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is

I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.

So is this problem solved now, is there a new version to try?

Jul 11 '22 06:07 p363796822

https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is

I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.

So is this problem solved now, is there a new version to try?

Not yet. For your situation, you need to change the etcd configuration as mentioned above. You can change the configuration by editing the compose file if you use docker-compose to start the whole milvus cluster including etcd, refer to : https://github.com/milvus-io/milvus/pull/17357/files#diff-10e860e50bee7caf0095374d72b2afdbb81b82704ef8c062f93d7adaf0ac3b54

Jul 11 '22 08:07 wayblink

There is a new problem, the log is as follows： [querynode1] milvus-querynode1 | [2022/07/12 03:17:28.832 +00:00] [WARN] [node.go:83] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856225110818817-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dmInputNode-query-432856225110818817-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_2_432856225019330561v0]"] ["duration "=2m0s] milvus-querynode1 | [2022/07/12 03:17:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528808388788225] [tSafe_p=2022/07/12 03:17:37.196 +00:00] [channel=by-dev-rootcoord-dml_7_432856638289084417v1] milvus-querynode1 | [2022/07/12 03:17:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528811009966081] [tSafe_p=2022/07/12 03:17:47.195 +00:00] [channel=by-dev-rootcoord-dml_2_432856225019330561v0] milvus-querynode1 | [2022/07/12 03:17:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528813618298881] [tSafe_p=2022/07/12 03:17:57.145 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:07.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528816252846081] [tSafe_p=2022/07/12 03:18:07.195 +00:00] [channel=by-dev-rootcoord-dml_6_432856638289084417v0] milvus-querynode1 | [2022/07/12 03:18:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528818874286081] [tSafe_p=2022/07/12 03:18:17.195 +00:00] [channel=by-dev-rootcoord-dml_6_432856638289084417v0] milvus-querynode1 | [2022/07/12 03:18:27.298 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528821495726081] [tSafe_p=2022/07/12 03:18:27.195 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:37.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528824117428225] [tSafe_p=2022/07/12 03:18:37.196 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode1 | [2022/07/12 03:18:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528826738868225] [tSafe_p=2022/07/12 03:18:47.196 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528829360046081] [tSafe_p=2022/07/12 03:18:57.195 +00:00] [channel=by-dev-rootcoord-dml_1_432856224953532417v1] milvus-querynode1 | [2022/07/12 03:19:07.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528831981486081] [tSafe_p=2022/07/12 03:19:07.195 +00:00] [channel=by-dev-rootcoord-dml_0_432856224953532417v0] milvus-querynode1 | [2022/07/12 03:19:17.301 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528834602926081] [tSafe_p=2022/07/12 03:19:17.195 +00:00] [channel=by-dev-rootcoord-dml_5_432856225110818817v1] [querynode2] milvus-querynode2 | [2022/07/12 03:17:06.616 +00:00] [WARN] [node.go:83] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_3_432856225019330561v1]"] ["duration "=2m0s] milvus-querynode2 | [2022/07/12 03:17:07.302 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528800524206081] [tSafe_p=2022/07/12 03:17:07.195 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528803145646081] [tSafe_p=2022/07/12 03:17:17.195 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:27.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528805767348225] [tSafe_p=2022/07/12 03:17:27.196 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528808388788225] [tSafe_p=2022/07/12 03:17:37.196 +00:00] [channel=by-dev-rootcoord-dml_15_433802452728283137v1] milvus-querynode2 | [2022/07/12 03:17:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528811009966081] [tSafe_p=2022/07/12 03:17:47.195 +00:00] [channel=by-dev-rootcoord-dml_15_433802452728283137v1] milvus-querynode2 | [2022/07/12 03:17:57.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528813618298881] [tSafe_p=2022/07/12 03:17:57.145 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:07.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528816252846081] [tSafe_p=2022/07/12 03:18:07.195 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528818874286081] [tSafe_p=2022/07/12 03:18:17.195 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:27.298 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528821495726081] [tSafe_p=2022/07/12 03:18:27.195 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode2 | [2022/07/12 03:18:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528824117428225] [tSafe_p=2022/07/12 03:18:37.196 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528826738868225] [tSafe_p=2022/07/12 03:18:47.196 +00:00] [channel=by-dev-rootcoord-dml_7_432856638289084417v1] milvus-querynode2 | [2022/07/12 03:18:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528829360046081] [tSafe_p=2022/07/12 03:18:57.195 +00:00] [channel=by-dev-rootcoord-dml_1_432856224953532417v1]

Jul 12 '22 03:07 p363796822

/unassign @p363796822 /assign @wayblink

Jul 13 '22 04:07 soothing-rain

No critical error printed. What's the problem in client side? You can upload the complete log.

Jul 18 '22 08:07 wayblink

No critical error printed. What's the problem in client side? You can upload the complete log.

The above log is output during the query, and the final result of the query is a timeout, and the data cannot be found normally.

Jul 18 '22 08:07 p363796822

some node(s) haven't received input

Please upload the complete log if you can reproduce the situation. Currently it is not enough for me to find the root cause.

Jul 18 '22 13:07 wayblink

@p363796822 Can you find the milvus log? Do you need other help?

Sep 05 '22 06:09 JackLCL

/close

Sep 21 '22 08:09 wayblink

@wayblink: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 21 '22 08:09 sre-ci-robot

milvus milvus copied to clipboard

[Bug]: complete comapction wrong: etcdserver: request is too large

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard