milvus
milvus copied to clipboard
[Bug]: complete comapction wrong: etcdserver: request is too large
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: milvus-dev:2.1.0-latest(image id:9f3ff7f688fc)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 16c/64G
- GPU: 无
- Others:
Current Behavior
- Use docker-compose to deploy version 2.0.2 of milvus on one server
- The number of querynode and datanode nodes is adjusted to 2
- There was a problem last Monday morning, 1 datanode node hung up, and the error message appeared when viewing the error log information: ResourceExhausted desc=trying to send message larger than max (3123942 vs. 2097152) and other information
- The above larger than error keeps appearing after restarting the service
- Use milvus-dev:2.1.0-latest(image id:9f3ff7f688fc) image, the service restarts successfully
- There was a problem this Monday, one querynode node hung up, and some collections could not be queried. The specific error message: failed to complete compaction etcdserver: request is too large and other information
- After restarting the service, no node hangs up, but it is still in a state that some collections cannot be queried
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: milvus-dev:2.1.0-latest(image id:9f3ff7f688fc) - Deployment mode(standalone or cluster): cluster - SDK version(e.g. pymilvus v2.0.0rc2): - OS(Ubuntu or CentOS): CentOS - CPU/Memory: 16c/64G - GPU: 无 - Others:Current Behavior
- Use docker-compose to deploy version 2.0.2 of milvus on one server
- The number of querynode and datanode nodes is adjusted to 2
- There was a problem last Monday morning, 1 datanode node hung up, and the error message appeared when viewing the error log information: ResourceExhausted desc=trying to send message larger than max (3123942 vs. 2097152) and other information
- The above larger than error keeps appearing after restarting the service
- Use milvus-dev:2.1.0-latest(image id:9f3ff7f688fc) image, the service restarts successfully
- There was a problem this Monday, one querynode node hung up, and some collections could not be queried. The specific error message: failed to complete compaction etcdserver: request is too large and other information
- After restarting the service, no node hangs up, but it is still in a state that some collections cannot be queried
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
Hi @p363796822 did you delete in your user cases? We just fix a issue about compaction, if delete too much there will be many delta log paths, cause the segment to be really large. If you are still under testing, drop the collection would work. Otherwise we have to think of a hack way to recover
- The data is not deleted, but we are writing and checking at the same time. The error reported last Monday happened suddenly. After it happened, the 2.0.2 version could not be started. After trying various methods, I finally thought of using our latest version to start
- The data is no longer test data, but data generated in production. There is no way to simply delete the collection to solve the problem.
- The data is not deleted, but we are writing and checking at the same time. The error reported last Monday happened suddenly. After it happened, the 2.0.2 version could not be started. After trying various methods, I finally thought of using our latest version to start
- The data is no longer test data, but data generated in production. There is no way to simply delete the collection to solve the problem.
- do we have delete or frequently num_entity/flush calls?
- if you can change the etcd config to relax the 2M limit to 5M, then later on after compaction the problem should be atomatically resolved, it this a way to fix?
- There is no num_entity/flush related code in the code, and the data has not been deleted after writing
- How to modify the etcd configuration? I tried adding the configuration to the docker-compose file and the service configuration file milvus.yaml before on version 2.0.2, but it didn't take effect. Can you give me a little more details? The process of modifying etcd configuration points, thank you
- There is no num_entity/flush related code in the code, and the data has not been deleted after writing
- How to modify the etcd configuration? I tried adding the configuration to the docker-compose file and the service configuration file milvus.yaml before on version 2.0.2, but it didn't take effect. Can you give me a little more details? The process of modifying etcd configuration points, thank you
could your offer your log? we need more log to investigate on why because if no frequent delete or flush this issues should not happen
@p363796822 Could you please refer this script to export the whole Milvus logs for investigation?
/assign @p363796822 /unassign
All the logs are too big. I will export the log information of the day before and after the error report. I will upload it here and see if it can help the investigation. If not, I will export more logs here.
Link: https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q Extraction code: xe5q After copying this content, open the Baidu SkyDrive mobile app, the operation is more convenient
@xiaofan-luan could you please have someone take a look at this issue? I think we are doing some improvement for etcd request size.
https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is
https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is
I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.
https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is
I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.
So is this problem solved now, is there a new version to try?
https://pan.baidu.com/s/1zVz-aTtnBxqzlJatcoEnKw?pwd=xe5q This is the network disk link where the log is located, please help to see what the specific problem is
I have seen the log. There are a lot of input binlogs in CompleteMergeCompaction which leads to a huge Binlogs field of the new compacted SegmentInfo(fail to store in etcd). We have a plan to save binlog separately, see: #17988. And we will explore if it is reasonable for so many binlogs to compact.
So is this problem solved now, is there a new version to try?
Not yet. For your situation, you need to change the etcd configuration as mentioned above. You can change the configuration by editing the compose file if you use docker-compose to start the whole milvus cluster including etcd, refer to : https://github.com/milvus-io/milvus/pull/17357/files#diff-10e860e50bee7caf0095374d72b2afdbb81b82704ef8c062f93d7adaf0ac3b54
There is a new problem, the log is as follows: [querynode1] milvus-querynode1 | [2022/07/12 03:17:28.832 +00:00] [WARN] [node.go:83] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856225110818817-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dmInputNode-query-432856225110818817-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_4_432856225110818817v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_5_432856225110818817v1,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_2_432856225019330561v0]"] ["duration "=2m0s] milvus-querynode1 | [2022/07/12 03:17:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528808388788225] [tSafe_p=2022/07/12 03:17:37.196 +00:00] [channel=by-dev-rootcoord-dml_7_432856638289084417v1] milvus-querynode1 | [2022/07/12 03:17:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528811009966081] [tSafe_p=2022/07/12 03:17:47.195 +00:00] [channel=by-dev-rootcoord-dml_2_432856225019330561v0] milvus-querynode1 | [2022/07/12 03:17:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528813618298881] [tSafe_p=2022/07/12 03:17:57.145 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:07.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528816252846081] [tSafe_p=2022/07/12 03:18:07.195 +00:00] [channel=by-dev-rootcoord-dml_6_432856638289084417v0] milvus-querynode1 | [2022/07/12 03:18:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528818874286081] [tSafe_p=2022/07/12 03:18:17.195 +00:00] [channel=by-dev-rootcoord-dml_6_432856638289084417v0] milvus-querynode1 | [2022/07/12 03:18:27.298 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528821495726081] [tSafe_p=2022/07/12 03:18:27.195 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:37.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528824117428225] [tSafe_p=2022/07/12 03:18:37.196 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode1 | [2022/07/12 03:18:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528826738868225] [tSafe_p=2022/07/12 03:18:47.196 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode1 | [2022/07/12 03:18:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528829360046081] [tSafe_p=2022/07/12 03:18:57.195 +00:00] [channel=by-dev-rootcoord-dml_1_432856224953532417v1] milvus-querynode1 | [2022/07/12 03:19:07.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528831981486081] [tSafe_p=2022/07/12 03:19:07.195 +00:00] [channel=by-dev-rootcoord-dml_0_432856224953532417v0] milvus-querynode1 | [2022/07/12 03:19:17.301 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528834602926081] [tSafe_p=2022/07/12 03:19:17.195 +00:00] [channel=by-dev-rootcoord-dml_5_432856225110818817v1] [querynode2] milvus-querynode2 | [2022/07/12 03:17:06.616 +00:00] [WARN] [node.go:83] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856224953532417-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_3_432856225019330561v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-stNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dmInputNode-query-432856225019330561-by-dev-rootcoord-delta_2_432856225019330561v0,nodeCtxTtChecker-dNode-by-dev-rootcoord-delta_0_432856224953532417v0,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_1_432856224953532417v1,nodeCtxTtChecker-fdNode-by-dev-rootcoord-delta_3_432856225019330561v1]"] ["duration "=2m0s] milvus-querynode2 | [2022/07/12 03:17:07.302 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528800524206081] [tSafe_p=2022/07/12 03:17:07.195 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528803145646081] [tSafe_p=2022/07/12 03:17:17.195 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:27.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528805767348225] [tSafe_p=2022/07/12 03:17:27.196 +00:00] [channel=by-dev-rootcoord-dml_14_433802452728283137v0] milvus-querynode2 | [2022/07/12 03:17:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528808388788225] [tSafe_p=2022/07/12 03:17:37.196 +00:00] [channel=by-dev-rootcoord-dml_15_433802452728283137v1] milvus-querynode2 | [2022/07/12 03:17:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=433802452728283137] [tSafe=434528811009966081] [tSafe_p=2022/07/12 03:17:47.195 +00:00] [channel=by-dev-rootcoord-dml_15_433802452728283137v1] milvus-querynode2 | [2022/07/12 03:17:57.300 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528813618298881] [tSafe_p=2022/07/12 03:17:57.145 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:07.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528816252846081] [tSafe_p=2022/07/12 03:18:07.195 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:17.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528818874286081] [tSafe_p=2022/07/12 03:18:17.195 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:27.298 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225019330561] [tSafe=434528821495726081] [tSafe_p=2022/07/12 03:18:27.195 +00:00] [channel=by-dev-rootcoord-dml_3_432856225019330561v1] milvus-querynode2 | [2022/07/12 03:18:37.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856225110818817] [tSafe=434528824117428225] [tSafe_p=2022/07/12 03:18:37.196 +00:00] [channel=by-dev-rootcoord-dml_4_432856225110818817v0] milvus-querynode2 | [2022/07/12 03:18:47.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856638289084417] [tSafe=434528826738868225] [tSafe_p=2022/07/12 03:18:47.196 +00:00] [channel=by-dev-rootcoord-dml_7_432856638289084417v1] milvus-querynode2 | [2022/07/12 03:18:57.299 +00:00] [DEBUG] [flow_graph_service_time_node.go:71] ["update tSafe:"] [collectionID=432856224953532417] [tSafe=434528829360046081] [tSafe_p=2022/07/12 03:18:57.195 +00:00] [channel=by-dev-rootcoord-dml_1_432856224953532417v1]
/unassign @p363796822 /assign @wayblink
No critical error printed. What's the problem in client side? You can upload the complete log.
No critical error printed. What's the problem in client side? You can upload the complete log.
The above log is output during the query, and the final result of the query is a timeout, and the data cannot be found normally.
some node(s) haven't received input
Please upload the complete log if you can reproduce the situation. Currently it is not enough for me to find the root cause.
@p363796822 Can you find the milvus log? Do you need other help?
/close
@wayblink: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.