milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: complete comapction wrong: etcdserver: request is too large

Open p363796822 opened this issue 3 years ago • 17 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: milvus-dev:2.1.0-latest(image id:9f3ff7f688fc)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 16c/64G
- GPU: 无
- Others:

Current Behavior

  1. Use docker-compose to deploy version 2.0.2 of milvus on one server
  2. The number of querynode and datanode nodes is adjusted to 2
  3. There was a problem last Monday morning, 1 datanode node hung up, and the error message appeared when viewing the error log information: ResourceExhausted desc=trying to send message larger than max (3123942 vs. 2097152) and other information
  4. The above larger than error keeps appearing after restarting the service
  5. Use milvus-dev:2.1.0-latest(image id:9f3ff7f688fc) image, the service restarts successfully
  6. There was a problem this Monday, one querynode node hung up, and some collections could not be queried. The specific error message: failed to complete compaction etcdserver: request is too large and other information
  7. After restarting the service, no node hangs up, but it is still in a state that some collections cannot be queried

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

p363796822 avatar Jul 05 '22 07:07 p363796822

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: milvus-dev:2.1.0-latest(image id:9f3ff7f688fc)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 16c/64G
- GPU: 无
- Others:

Current Behavior

  1. Use docker-compose to deploy version 2.0.2 of milvus on one server
  2. The number of querynode and datanode nodes is adjusted to 2
  3. There was a problem last Monday morning, 1 datanode node hung up, and the error message appeared when viewing the error log information: ResourceExhausted desc=trying to send message larger than max (3123942 vs. 2097152) and other information
  4. The above larger than error keeps appearing after restarting the service
  5. Use milvus-dev:2.1.0-latest(image id:9f3ff7f688fc) image, the service restarts successfully
  6. There was a problem this Monday, one querynode node hung up, and some collections could not be queried. The specific error message: failed to complete compaction etcdserver: request is too large and other information
  7. After restarting the service, no node hangs up, but it is still in a state that some collections cannot be queried

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

Hi @p363796822 did you delete in your user cases? We just fix a issue about compaction, if delete too much there will be many delta log paths, cause the segment to be really large. If you are still under testing, drop the collection would work. Otherwise we have to think of a hack way to recover

xiaofan-luan avatar Jul 05 '22 07:07 xiaofan-luan

  1. The data is not deleted, but we are writing and checking at the same time. The error reported last Monday happened suddenly. After it happened, the 2.0.2 version could not be started. After trying various methods, I finally thought of using our latest version to start
  2. The data is no longer test data, but data generated in production. There is no way to simply delete the collection to solve the problem.

p363796822 avatar Jul 05 '22 08:07 p363796822

  1. The data is not deleted, but we are writing and checking at the same time. The error reported last Monday happened suddenly. After it happened, the 2.0.2 version could not be started. After trying various methods, I finally thought of using our latest version to start
  2. The data is no longer test data, but data generated in production. There is no way to simply delete the collection to solve the problem.
  1. do we have delete or frequently num_entity/flush calls?
  2. if you can change the etcd config to relax the 2M limit to 5M, then later on after compaction the problem should be atomatically resolved, it this a way to fix?

xiaofan-luan avatar Jul 05 '22 08:07 xiaofan-luan

  1. There is no num_entity/flush related code in the code, and the data has not been deleted after writing
  2. How to modify the etcd configuration? I tried adding the configuration to the docker-compose file and the service configuration file milvus.yaml before on version 2.0.2, but it didn't take effect. Can you give me a little more details? The process of modifying etcd configuration points, thank you

p363796822 avatar Jul 05 '22 09:07 p363796822

  1. There is no num_entity/flush related code in the code, and the data has not been deleted after writing
  2. How to modify the etcd configuration? I tried adding the configuration to the docker-compose file and the service configuration file milvus.yaml before on version 2.0.2, but it didn't take effect. Can you give me a little more details? The process of modifying etcd configuration points, thank you

could your offer your log? we need more log to investigate on why because if no frequent delete or flush this issues should not happen

xiaofan-luan avatar Jul 05 '22 12:07 xiaofan-luan

@p363796822 Could you please refer this script to export the whole Milvus logs for investigation?

/assign @p363796822 /unassign

yanliang567 avatar Jul 06 '22 01:07 yanliang567

All the logs are too big. I will export the log information of the day before and after the error report. I will upload it here and see if it can help the investigation. If not, I will export more logs here.

Link: https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q Extraction code: xe5q After copying this content, open the Baidu SkyDrive mobile app, the operation is more convenient

p363796822 avatar Jul 06 '22 01:07 p363796822

@xiaofan-luan could you please have someone take a look at this issue? I think we are doing some improvement for etcd request size.

yanliang567 avatar Jul 06 '22 02:07 yanliang567

https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is

p363796822 avatar Jul 06 '22 02:07 p363796822

https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is

I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.

wayblink avatar Jul 11 '22 05:07 wayblink

https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is

I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.

So is this problem solved now, is there a new version to try?

p363796822 avatar Jul 11 '22 06:07 p363796822

https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is

I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.

So is this problem solved now, is there a new version to try?

Not yet. For your situation, you need to change the etcd configuration as mentioned above. You can change the configuration by editing the compose file if you use docker-compose to start the whole milvus cluster including etcd, refer to : https://github.com/milvus-io/milvus/pull/17357/files#diff-10e860e50bee7caf0095374d72b2afdbb81b82704ef8c062f93d7adaf0ac3b54

wayblink avatar Jul 11 '22 08:07 wayblink

There is a new problem, the log is as follows: [querynode1] milvus-querynode1 | [2022/07/12 03:17:28.832 +00:00] [WARN] [node.go:83] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856225110818817-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dmInputNode-query-432856225110818817-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_2_432856225019330561v0]"] ["duration "=2m0s] milvus-querynode1 | [2022/07/12 03:17:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528808388788225] [tSafe_p=2022/07/12 03:17:37.196 +00:00] [channel=by-dev-rootcoord-dml_7_432856638289084417v1] milvus-querynode1 | [2022/07/12 03:17:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528811009966081] [tSafe_p=2022/07/12 03:17:47.195 +00:00] [channel=by-dev-rootcoord-dml_2_432856225019330561v0] milvus-querynode1 | [2022/07/12 03:17:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528813618298881] [tSafe_p=2022/07/12 03:17:57.145 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:07.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528816252846081] [tSafe_p=2022/07/12 03:18:07.195 +00:00] [channel=by-dev-rootcoord-dml_6_432856638289084417v0] milvus-querynode1 | [2022/07/12 03:18:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528818874286081] [tSafe_p=2022/07/12 03:18:17.195 +00:00] [channel=by-dev-rootcoord-dml_6_432856638289084417v0] milvus-querynode1 | [2022/07/12 03:18:27.298 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528821495726081] [tSafe_p=2022/07/12 03:18:27.195 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:37.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528824117428225] [tSafe_p=2022/07/12 03:18:37.196 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode1 | [2022/07/12 03:18:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528826738868225] [tSafe_p=2022/07/12 03:18:47.196 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528829360046081] [tSafe_p=2022/07/12 03:18:57.195 +00:00] [channel=by-dev-rootcoord-dml_1_432856224953532417v1] milvus-querynode1 | [2022/07/12 03:19:07.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528831981486081] [tSafe_p=2022/07/12 03:19:07.195 +00:00] [channel=by-dev-rootcoord-dml_0_432856224953532417v0] milvus-querynode1 | [2022/07/12 03:19:17.301 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528834602926081] [tSafe_p=2022/07/12 03:19:17.195 +00:00] [channel=by-dev-rootcoord-dml_5_432856225110818817v1] [querynode2] milvus-querynode2 | [2022/07/12 03:17:06.616 +00:00] [WARN] [node.go:83] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_3_432856225019330561v1]"] ["duration "=2m0s] milvus-querynode2 | [2022/07/12 03:17:07.302 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528800524206081] [tSafe_p=2022/07/12 03:17:07.195 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528803145646081] [tSafe_p=2022/07/12 03:17:17.195 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:27.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528805767348225] [tSafe_p=2022/07/12 03:17:27.196 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528808388788225] [tSafe_p=2022/07/12 03:17:37.196 +00:00] [channel=by-dev-rootcoord-dml_15_433802452728283137v1] milvus-querynode2 | [2022/07/12 03:17:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528811009966081] [tSafe_p=2022/07/12 03:17:47.195 +00:00] [channel=by-dev-rootcoord-dml_15_433802452728283137v1] milvus-querynode2 | [2022/07/12 03:17:57.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528813618298881] [tSafe_p=2022/07/12 03:17:57.145 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:07.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528816252846081] [tSafe_p=2022/07/12 03:18:07.195 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528818874286081] [tSafe_p=2022/07/12 03:18:17.195 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:27.298 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528821495726081] [tSafe_p=2022/07/12 03:18:27.195 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode2 | [2022/07/12 03:18:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528824117428225] [tSafe_p=2022/07/12 03:18:37.196 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528826738868225] [tSafe_p=2022/07/12 03:18:47.196 +00:00] [channel=by-dev-rootcoord-dml_7_432856638289084417v1] milvus-querynode2 | [2022/07/12 03:18:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528829360046081] [tSafe_p=2022/07/12 03:18:57.195 +00:00] [channel=by-dev-rootcoord-dml_1_432856224953532417v1]

p363796822 avatar Jul 12 '22 03:07 p363796822

/unassign @p363796822 /assign @wayblink

soothing-rain avatar Jul 13 '22 04:07 soothing-rain

No critical error printed. What's the problem in client side? You can upload the complete log.

wayblink avatar Jul 18 '22 08:07 wayblink

No critical error printed. What's the problem in client side? You can upload the complete log.

The above log is output during the query, and the final result of the query is a timeout, and the data cannot be found normally.

p363796822 avatar Jul 18 '22 08:07 p363796822

some node(s) haven't received input

Please upload the complete log if you can reproduce the situation. Currently it is not enough for me to find the root cause.

wayblink avatar Jul 18 '22 13:07 wayblink

@p363796822 Can you find the milvus log? Do you need other help?

JackLCL avatar Sep 05 '22 06:09 JackLCL

/close

wayblink avatar Sep 21 '22 08:09 wayblink

@wayblink: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sre-ci-robot avatar Sep 21 '22 08:09 sre-ci-robot