milvus
milvus copied to clipboard
[Bug]: [kafka] "wait_for_compaction_completed" is failed for "Get compaction state timeout"
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:latest
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
wait_for_compaction_completed is failed for "Get compaction state timeout"
[2022-06-22T15:55:45.885Z] [2022-06-22 15:07:10 - ERROR - pymilvus.decorators]: RPC error: [wait_for_compaction_completed], <MilvusException: (code=1, message=Get compaction state timeout)>, <Time:{'RPC start': '2022-06-22 15:06:09.762767', 'RPC error': '2022-06-22 15:07:10.048962'}> (decorators.py:78)
Expected Behavior
No error
Steps To Reproduce
Running nightly:
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/609/pipeline/145
Milvus Log
-
milvus log: artifacts-milvus-distributed-kafka-master-609-pymilvus-e2e-logs.tar.gz
-
collection name: compact_6jKXYIzM
-
FAILED time: [2022-06-22T15:07:10.701Z] [gw4] [ 16%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index
Anything else?
No response
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-ha-ci/detail/PR-17742/4/pipeline
It occurs in the following nightly again: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/616/pipeline/141
logs: artifacts-milvus-distributed-kafka-master-616-pymilvus-e2e-logs.tar.gz
collection: compact_n6v8Gg6j
[2022/06/25 19:06:26.307 +00:00] [INFO] [compaction_trigger.go:364] ["generate a plan for priority candidates"] [plan="segmentBinlogs:<segmentID:434158693092950017 fieldBinlogs:<fieldID:100 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/100/434158693800738817" log_size:10265 > > fieldBinlogs:<fieldID:101 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/101/434158693800738818" log_size:12507 > > fieldBinlogs:<fieldID:102 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/102/434158693800738819" log_size:6905 > > fieldBinlogs:<fieldID:103 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/103/434158693800738820" log_size:1384546 > > fieldBinlogs:<binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/0/434158693800738821" log_size:8132 > > fieldBinlogs:<fieldID:1 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/1/434158693800738822" log_size:543 > > field2StatslogPaths:<fieldID:100 binlogs:<log_path:"file/stats_log/434158693027676161/434158693027676162/434158693092950017/100/434158693800738817" log_size:183905 > > deltalogs:<> > segmentBinlogs:<segmentID:434158693958025217 fieldBinlogs:<fieldID:100 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/100/434158694626492417" log_size:8129 > > fieldBinlogs:<fieldID:101 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/101/434158694626492418" log_size:12296 > > fieldBinlogs:<fieldID:102 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/102/434158694626492419" log_size:6676 > > fieldBinlogs:<fieldID:103 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/103/434158694626492420" log_size:1384511 > > fieldBinlogs:<binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/0/434158694626492421" log_size:8132 > > fieldBinlogs:<fieldID:1 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/1/434158694626492422" log_size:543 > > field2StatslogPaths:<fieldID:100 binlogs:<log_path:"file/stats_log/434158693027676161/434158693027676162/434158693958025217/100/434158694626492417" log_size:183908 > > deltalogs:<> > type:MixCompaction timetravel:434045448693743616 channel:"by-dev-rootcoord-dml_227_434158693027676161v0" "] ["target segment row"=6000] ["target segment size"=3210998]
[2022/06/25 19:06:26.308 +00:00] [INFO] [compaction_trigger.go:256] ["time cost of generating global compaction"] [planID=434158694901743618] ["time cost"=0] [collectionID=434158693027676161] [channel=by-dev-rootcoord-dml_227_434158693027676161v0] [partitionID=434158693027676162]
this case failed with the following scenarios:
- the segment start to build the index on IndexCoord
- the segment acquires a lock from Datacoord on IndexCoord
- starting compact with the segment on Datacoord
- compaction needs to the lock release, but waiting timeout.
./mdk-616-n-milvus-datacoord-664d85db5b-ltz5h_milvus-ci_datacoord-6d5dc77ae0ad3e773bacafcbd95fc48731226fbc1d5f5b0459a971cd4780828d.log:44889:[2022/06/25 19:06:26.598 +00:00] [INFO] [segment_reference_manager.go:87] ["add reference lock on segments"] [segIDs="[434158693092950017]"] [
nodeID=5]
./mdk-616-n-milvus-datacoord-664d85db5b-ltz5h_milvus-ci_datacoord-6d5dc77ae0ad3e773bacafcbd95fc48731226fbc1d5f5b0459a971cd4780828d.log:44891:[2022/06/25 19:06:26.601 +00:00] [INFO] [segment_reference_manager.go:112] ["add reference lock on segments successfully"] [segIDs="[434158693
092950017]"] [nodeID=5]
./mdk-616-n-milvus-datacoord-664d85db5b-ltz5h_milvus-ci_datacoord-6d5dc77ae0ad3e773bacafcbd95fc48731226fbc1d5f5b0459a971cd4780828d.log:44933:[2022/06/25 19:06:26.798 +00:00] [ERROR] [services.go:807] ["failed to complete compaction"] [planID=434158694901743618] [error="can not be compacted, segment with ID 434158693092950017 has reference lock"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).CompleteCompaction\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:807\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).CompleteCompaction\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:296\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_CompleteCompaction_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5133\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_CompleteCompaction_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5135\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1616\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:921"]
cc @xiaocai2333
the solution needs to modify test case
- to execute compact after create index finished
- to execute create index after compact finished
- add timeout to 90s
@jaime0815 By adding timeout to 90s (collection_w.wait_for_compaction_completed(timeout=90)), the pulsar nightly is good now.
But the kafka nightly is still reporting this error.
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/125/pipeline
@jaime0815 By adding timeout to 90s (collection_w.wait_for_compaction_completed(timeout=90)), the pulsar nightly is good now.
But the kafka nightly is still reporting this error.
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/125/pipeline
the same issue, try to add the timeout to 180s
./mdk-release-125-n-milvus-datanode-7595d5dcc8-8qg75_milvus-ci_datanode-a780b8de5fa45799980bfd5420413e5098fb49659ce7f6743df368149b4ed07e.log:56179:[2022/06/30 20:51:50.979 +00:00] [ERROR] [compactor.go:524] ["complete compaction wrong"] [planID=434273599027150850] [reason="can not be compacted, segment with ID 434273597310369793 has reference lock"] [stack="github.com/milvus-io/milvus/internal/datanode.(*compactionTask).compact\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/compactor.go:524\ngithub.com/milvus-io/milvus/internal/datanode.(*compactionExecutor).executeTask\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/compaction_executor.go:99"]
./mdk-release-125-n-milvus-datacoord-5fb48b684f-mxjgr_milvus-ci_datacoord-68cca3318d52f646551087ac544e6f78110c39051a72504a34d2f0187af6a841.log:65951:[2022/06/30 20:54:25.946 +00:00] [INFO] [segment_reference_manager.go:112] ["add reference lock on segments successfully"] [segIDs="[434273638886146050]"] [nodeID=1]
./mdk-release-125-n-milvus-indexcoord-85b4bf789b-x4b8g_milvus-ci_indexcoord-d24f1b58d529b851e33a970099ec3bb1d5a14f7eb3e4b7fe20d19ad2a6c6f9e9.log:9017:[2022/06/30 20:54:25.713 +00:00] [DEBUG] [index_coord.go:407] ["IndexCoord building index ..."] [segmentID=434273638886146050] [IndexName=_default_idx] [IndexID=434273639030587393] [DataPath="[file/insert_log/434273638034178049/434273638034178050/434273638886146050/103/434273639633256452]"] [TypeParams="[{\"key\":\"dim\",\"value\":\"128\"}]"] [IndexParams="[{\"key\":\"index_type\",\"value\":\"IVF_FLAT\"},{\"key\":\"metric_type\",\"value\":\"IP\"},{\"key\":\"params\",\"value\":\"{\\\"nlist\\\": 64}\"}]"] [numRows=2994] ["field type"=FloatVector]
./mdk-release-125-n-milvus-datacoord-5fb48b684f-mxjgr_milvus-ci_datacoord-68cca3318d52f646551087ac544e6f78110c39051a72504a34d2f0187af6a841.log:65971:[2022/06/30 20:54:26.107 +00:00] [ERROR] [services.go:807] ["failed to complete compaction"] [planID=434273639646363650] [error="can not be compacted, segment with ID 434273638886146050 has reference lock"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).CompleteCompaction\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:807\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).CompleteCompaction\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:296\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_CompleteCompaction_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5133\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_CompleteCompaction_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5135\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1616\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:921"]
/assign @binbinlv
@jaime0815 when change timeout to 180s, it still exists.
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/633/pipeline/141
- milvus log: artifacts-milvus-distributed-kafka-master-633-pymilvus-e2e-logs.tar.gz
- collection name: compact_SxLvr12w
- Failed time: [2022-07-04T15:08:13.113Z] [gw0] [ 26%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index
/assign
compact will fail directly if segment reference lock is hold
to fix this issue, catch the error and retry on the client side if the compact fails, or to compact after the lock release with timeout within datacoord
There is a simple discussion with @xiaofan-luan , if there is a reference lock during compaction, it is directly marked as completed. The task is for nothing.
Now it is reproduced nearly every nightly in both 2.1.0 and master, so set it as urgent.
Same issue in latest nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/646/pipeline/124
-
milvus log: artifacts-milvus-standalone-master-646-pymilvus-e2e-logs.tar.gz
-
collection name: compact_xLMWwHx1
-
FAILED time: [2022-07-10T19:03:09.639Z] [gw3] [ 25%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index
Same issue in latest nightly for master: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/648/pipeline
-
milvus log: artifacts-milvus-standalone-master-648-pymilvus-e2e-logs.tar.gz
-
collection name: compact_9YoL0i9A
-
Failed time: [2022-07-11T19:03:29.695Z] [gw3] [ 25%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index
Same issue in latest nightly for 2.1.0 branch https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/148/pipeline
- milvus log: milvus-standalone-148-pymilvus-e2e-logs.tar.gz
- collection name: compact_Qyo75tD3
- Failed time: [2022-07-11T23:47:44.033Z] [gw2] [ 32%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index
refer issue #18190
This issue appears continuously nearly every nightly, and the latest is: https://ci.milvus.io:18080/jenkins/blue/rest/organizations/jenkins/pipelines/milvus-nightly-ci/branches/master/runs/676/nodes/124/steps/162/log/?start=0
-
milvus log: artifacts-milvus-standalone-master-676-pymilvus-e2e-logs.tar.gz
-
collection name: compact_rIji3Y2O
-
Failed time: [2022-07-25T19:04:35.740Z] [gw4] [ 24%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index
This issue appears continuously nearly every nightly, and the latest is: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/25/pipeline/123
- milvus log: artifacts-milvus-standalone-nightly-25-pymilvus-e2e-logs.tar.gz
- collection name: compact_gFARsRGw
- Failed time: [2022-08-08T19:02:16.097Z] [gw0] [ 29%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index
This issue appears again: https://jenkins.milvus.io:18080/blue/rest/organizations/jenkins/pipelines/qa/pipelines/milvus-release-nightly/runs/57/nodes/142/steps/165/log/?start=0
Taking one failed case (test_compact_manual_and_auto) as an example:
- milvus log: milvus-standalone-57-pymilvus-e2e-logs.tar.gz
- collection name: compact_Jzy7BynZ
- Failed timeline: [2022-08-23T23:22:13.408Z] [gw5] [ 14%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_manual_and_auto
- Failed case:
[2022-08-24T00:06:05.825Z] @pytest.mark.tags(CaseLabel.L1)
[2022-08-24T00:06:05.825Z] def test_compact_manual_and_auto(self):
[2022-08-24T00:06:05.825Z] """
[2022-08-24T00:06:05.825Z] target: test compact manual and auto
[2022-08-24T00:06:05.825Z] method: 1.create with shard_num=1
[2022-08-24T00:06:05.825Z] 2.insert one and flush (11 times)
[2022-08-24T00:06:05.825Z] 3.compact
[2022-08-24T00:06:05.825Z] 4.load and search
[2022-08-24T00:06:05.825Z] expected: Verify segments info
[2022-08-24T00:06:05.825Z] """
[2022-08-24T00:06:05.825Z] # greater than auto-merge threshold 10
[2022-08-24T00:06:05.825Z] num_of_segment = ct.compact_segment_num_threshold + 1
[2022-08-24T00:06:05.825Z]
[2022-08-24T00:06:05.825Z] # create collection shard_num=1, insert 11 segments, each with one entity
[2022-08-24T00:06:05.825Z] collection_w = self.collection_insert_multi_segments_one_shard(prefix, num_of_segment=num_of_segment)
[2022-08-24T00:06:05.825Z]
[2022-08-24T00:06:05.825Z] # waiting for auto compaction finished
[2022-08-24T00:06:05.825Z] collection_w.collection.compaction_id = 0
[2022-08-24T00:06:05.825Z] > collection_w.wait_for_compaction_completed()
Not only kafka issue, so cancel kafka only label
@jiaoew1991
hello, enwei, I see you changed this case yesterday. Could you help to investigate this issue? Thanks.
the testcase which I changed yesterday is not right, after discussed with @ThreadDao should change the waiting for compaction completed
to sleep 1minute
The issue continuously appears in both master and 2.1.*
Taking one failed case (test_compact_only_growing_segment) as an example:
- Collection name: compact_92BffaT3
- milvus log: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/57/artifacts
- Failed timeline: [2022-08-24T19:03:19.003Z] [gw2] [ 13%] FAILED testcases/test_compaction.py::TestCompactionParams::test_compact_only_growing_segment
The issue continuously appears in both master and 2.1.*
Taking one failed case (test_compact_manual_and_auto) as an example:
-
milvus:2.1.*(a9eb72e5) pymilvus: 2.1.2.dev2
-
collection name : 'compact_pGNmV856'
[2022-08-29T00:04:47.692Z] [2022-08-28 23:21:11 - ERROR - pymilvus.decorators]: RPC error: [wait_for_compaction_completed], <MilvusException: (code=1, message=Get compaction state timeout)>, <Time:{'RPC start': '2022-08-28 23:20:11.410363', 'RPC error': '2022-08-28 23:21:11.624625'}> (decorators.py:95)
The issue steadily appears every day. Taking one failed case (test_compact_manual_and_auto) as an example:
- milvus: 2.1.* pymilvus:2.1.2.dev2
- collection name: compact_QlHsE95l
- log: milvus-standalone-81-pymilvus-e2e-logs.tar.gz
[2022-09-05T00:14:23.791Z] [2022-09-04 23:22:33 - ERROR - pymilvus.decorators]: RPC error: [wait_for_compaction_completed], <MilvusException: (code=1, message=Get compaction state timeout)>, <Time:{'RPC start': '2022-09-04 23:21:33.329105', 'RPC error': '2022-09-04 23:22:33.545751'}> (decorators.py:95)
This issue no longer occurs in master(c58a8d8).