milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [kafka] "wait_for_compaction_completed" is failed for "Get compaction state timeout"

Open binbinlv opened this issue 2 years ago • 21 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:latest
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

wait_for_compaction_completed is failed for "Get compaction state timeout"


[2022-06-22T15:55:45.885Z] [2022-06-22 15:07:10 - ERROR - pymilvus.decorators]: RPC error: [wait_for_compaction_completed], <MilvusException: (code=1, message=Get compaction state timeout)>, <Time:{'RPC start': '2022-06-22 15:06:09.762767', 'RPC error': '2022-06-22 15:07:10.048962'}> (decorators.py:78)

Expected Behavior

No error

Steps To Reproduce

Running nightly:
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/609/pipeline/145

Milvus Log

  1. milvus log: artifacts-milvus-distributed-kafka-master-609-pymilvus-e2e-logs.tar.gz

  2. collection name: compact_6jKXYIzM

  3. FAILED time: [2022-06-22T15:07:10.701Z] [gw4] [ 16%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index

Anything else?

No response

binbinlv avatar Jun 23 '22 02:06 binbinlv

https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-ha-ci/detail/PR-17742/4/pipeline

bigsheeper avatar Jun 24 '22 06:06 bigsheeper

It occurs in the following nightly again: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/616/pipeline/141

logs: artifacts-milvus-distributed-kafka-master-616-pymilvus-e2e-logs.tar.gz

collection: compact_n6v8Gg6j

binbinlv avatar Jun 27 '22 04:06 binbinlv

[2022/06/25 19:06:26.307 +00:00] [INFO] [compaction_trigger.go:364] ["generate a plan for priority candidates"] [plan="segmentBinlogs:<segmentID:434158693092950017 fieldBinlogs:<fieldID:100 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/100/434158693800738817" log_size:10265 > > fieldBinlogs:<fieldID:101 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/101/434158693800738818" log_size:12507 > > fieldBinlogs:<fieldID:102 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/102/434158693800738819" log_size:6905 > > fieldBinlogs:<fieldID:103 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/103/434158693800738820" log_size:1384546 > > fieldBinlogs:<binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/0/434158693800738821" log_size:8132 > > fieldBinlogs:<fieldID:1 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693092950017/1/434158693800738822" log_size:543 > > field2StatslogPaths:<fieldID:100 binlogs:<log_path:"file/stats_log/434158693027676161/434158693027676162/434158693092950017/100/434158693800738817" log_size:183905 > > deltalogs:<> > segmentBinlogs:<segmentID:434158693958025217 fieldBinlogs:<fieldID:100 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/100/434158694626492417" log_size:8129 > > fieldBinlogs:<fieldID:101 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/101/434158694626492418" log_size:12296 > > fieldBinlogs:<fieldID:102 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/102/434158694626492419" log_size:6676 > > fieldBinlogs:<fieldID:103 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/103/434158694626492420" log_size:1384511 > > fieldBinlogs:<binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/0/434158694626492421" log_size:8132 > > fieldBinlogs:<fieldID:1 binlogs:<entries_num:3000 log_path:"file/insert_log/434158693027676161/434158693027676162/434158693958025217/1/434158694626492422" log_size:543 > > field2StatslogPaths:<fieldID:100 binlogs:<log_path:"file/stats_log/434158693027676161/434158693027676162/434158693958025217/100/434158694626492417" log_size:183908 > > deltalogs:<> > type:MixCompaction timetravel:434045448693743616 channel:"by-dev-rootcoord-dml_227_434158693027676161v0" "] ["target segment row"=6000] ["target segment size"=3210998]

[2022/06/25 19:06:26.308 +00:00] [INFO] [compaction_trigger.go:256] ["time cost of generating global compaction"] [planID=434158694901743618] ["time cost"=0] [collectionID=434158693027676161] [channel=by-dev-rootcoord-dml_227_434158693027676161v0] [partitionID=434158693027676162]

jaime0815 avatar Jun 28 '22 03:06 jaime0815

this case failed with the following scenarios:

  1. the segment start to build the index on IndexCoord
  2. the segment acquires a lock from Datacoord on IndexCoord
  3. starting compact with the segment on Datacoord
  4. compaction needs to the lock release, but waiting timeout.
./mdk-616-n-milvus-datacoord-664d85db5b-ltz5h_milvus-ci_datacoord-6d5dc77ae0ad3e773bacafcbd95fc48731226fbc1d5f5b0459a971cd4780828d.log:44889:[2022/06/25 19:06:26.598 +00:00] [INFO] [segment_reference_manager.go:87] ["add reference lock on segments"] [segIDs="[434158693092950017]"] [
nodeID=5]
./mdk-616-n-milvus-datacoord-664d85db5b-ltz5h_milvus-ci_datacoord-6d5dc77ae0ad3e773bacafcbd95fc48731226fbc1d5f5b0459a971cd4780828d.log:44891:[2022/06/25 19:06:26.601 +00:00] [INFO] [segment_reference_manager.go:112] ["add reference lock on segments successfully"] [segIDs="[434158693
092950017]"] [nodeID=5]
./mdk-616-n-milvus-datacoord-664d85db5b-ltz5h_milvus-ci_datacoord-6d5dc77ae0ad3e773bacafcbd95fc48731226fbc1d5f5b0459a971cd4780828d.log:44933:[2022/06/25 19:06:26.798 +00:00] [ERROR] [services.go:807] ["failed to complete compaction"] [planID=434158694901743618] [error="can not be compacted, segment with ID 434158693092950017 has reference lock"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).CompleteCompaction\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:807\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).CompleteCompaction\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:296\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_CompleteCompaction_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5133\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_CompleteCompaction_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5135\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1616\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:921"]

jaime0815 avatar Jun 28 '22 07:06 jaime0815

cc @xiaocai2333

jaime0815 avatar Jun 28 '22 07:06 jaime0815

the solution needs to modify test case

  1. to execute compact after create index finished
  2. to execute create index after compact finished
  3. add timeout to 90s

jaime0815 avatar Jun 28 '22 07:06 jaime0815

@jaime0815 By adding timeout to 90s (collection_w.wait_for_compaction_completed(timeout=90)), the pulsar nightly is good now.

But the kafka nightly is still reporting this error.

https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/125/pipeline

binbinlv avatar Jul 01 '22 02:07 binbinlv

@jaime0815 By adding timeout to 90s (collection_w.wait_for_compaction_completed(timeout=90)), the pulsar nightly is good now.

But the kafka nightly is still reporting this error.

https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/125/pipeline

the same issue, try to add the timeout to 180s

./mdk-release-125-n-milvus-datanode-7595d5dcc8-8qg75_milvus-ci_datanode-a780b8de5fa45799980bfd5420413e5098fb49659ce7f6743df368149b4ed07e.log:56179:[2022/06/30 20:51:50.979 +00:00] [ERROR] [compactor.go:524] ["complete compaction wrong"] [planID=434273599027150850] [reason="can not be compacted, segment with ID 434273597310369793 has reference lock"] [stack="github.com/milvus-io/milvus/internal/datanode.(*compactionTask).compact\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/compactor.go:524\ngithub.com/milvus-io/milvus/internal/datanode.(*compactionExecutor).executeTask\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/compaction_executor.go:99"]
./mdk-release-125-n-milvus-datacoord-5fb48b684f-mxjgr_milvus-ci_datacoord-68cca3318d52f646551087ac544e6f78110c39051a72504a34d2f0187af6a841.log:65951:[2022/06/30 20:54:25.946 +00:00] [INFO] [segment_reference_manager.go:112] ["add reference lock on segments successfully"] [segIDs="[434273638886146050]"] [nodeID=1]
./mdk-release-125-n-milvus-indexcoord-85b4bf789b-x4b8g_milvus-ci_indexcoord-d24f1b58d529b851e33a970099ec3bb1d5a14f7eb3e4b7fe20d19ad2a6c6f9e9.log:9017:[2022/06/30 20:54:25.713 +00:00] [DEBUG] [index_coord.go:407] ["IndexCoord building index ..."] [segmentID=434273638886146050] [IndexName=_default_idx] [IndexID=434273639030587393] [DataPath="[file/insert_log/434273638034178049/434273638034178050/434273638886146050/103/434273639633256452]"] [TypeParams="[{\"key\":\"dim\",\"value\":\"128\"}]"] [IndexParams="[{\"key\":\"index_type\",\"value\":\"IVF_FLAT\"},{\"key\":\"metric_type\",\"value\":\"IP\"},{\"key\":\"params\",\"value\":\"{\\\"nlist\\\": 64}\"}]"] [numRows=2994] ["field type"=FloatVector]
./mdk-release-125-n-milvus-datacoord-5fb48b684f-mxjgr_milvus-ci_datacoord-68cca3318d52f646551087ac544e6f78110c39051a72504a34d2f0187af6a841.log:65971:[2022/06/30 20:54:26.107 +00:00] [ERROR] [services.go:807] ["failed to complete compaction"] [planID=434273639646363650] [error="can not be compacted, segment with ID 434273638886146050 has reference lock"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).CompleteCompaction\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:807\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).CompleteCompaction\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:296\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_CompleteCompaction_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5133\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_CompleteCompaction_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5135\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1616\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:921"]

jaime0815 avatar Jul 01 '22 03:07 jaime0815

/assign @binbinlv

jaime0815 avatar Jul 04 '22 08:07 jaime0815

@jaime0815 when change timeout to 180s, it still exists.

https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/633/pipeline/141

  1. milvus log: artifacts-milvus-distributed-kafka-master-633-pymilvus-e2e-logs.tar.gz
  2. collection name: compact_SxLvr12w
  3. Failed time: [2022-07-04T15:08:13.113Z] [gw0] [ 26%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index

binbinlv avatar Jul 05 '22 02:07 binbinlv

/assign

xiaocai2333 avatar Jul 05 '22 12:07 xiaocai2333

compact will fail directly if segment reference lock is hold

jaime0815 avatar Jul 05 '22 12:07 jaime0815

to fix this issue, catch the error and retry on the client side if the compact fails, or to compact after the lock release with timeout within datacoord

jaime0815 avatar Jul 05 '22 12:07 jaime0815

There is a simple discussion with @xiaofan-luan , if there is a reference lock during compaction, it is directly marked as completed. The task is for nothing.

xiaocai2333 avatar Jul 07 '22 09:07 xiaocai2333

Now it is reproduced nearly every nightly in both 2.1.0 and master, so set it as urgent.

binbinlv avatar Jul 08 '22 02:07 binbinlv

Same issue in latest nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/646/pipeline/124

  1. milvus log: artifacts-milvus-standalone-master-646-pymilvus-e2e-logs.tar.gz

  2. collection name: compact_xLMWwHx1

  3. FAILED time: [2022-07-10T19:03:09.639Z] [gw3] [ 25%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index

binbinlv avatar Jul 11 '22 02:07 binbinlv

Same issue in latest nightly for master: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/648/pipeline

  1. milvus log: artifacts-milvus-standalone-master-648-pymilvus-e2e-logs.tar.gz

  2. collection name: compact_9YoL0i9A

  3. Failed time: [2022-07-11T19:03:29.695Z] [gw3] [ 25%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index

binbinlv avatar Jul 12 '22 02:07 binbinlv

Same issue in latest nightly for 2.1.0 branch https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-release-nightly/detail/milvus-release-nightly/148/pipeline

  1. milvus log: milvus-standalone-148-pymilvus-e2e-logs.tar.gz
  2. collection name: compact_Qyo75tD3
  3. Failed time: [2022-07-11T23:47:44.033Z] [gw2] [ 32%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index

binbinlv avatar Jul 12 '22 02:07 binbinlv

refer issue #18190

xiaocai2333 avatar Jul 12 '22 06:07 xiaocai2333

This issue appears continuously nearly every nightly, and the latest is: https://ci.milvus.io:18080/jenkins/blue/rest/organizations/jenkins/pipelines/milvus-nightly-ci/branches/master/runs/676/nodes/124/steps/162/log/?start=0

  1. milvus log: artifacts-milvus-standalone-master-676-pymilvus-e2e-logs.tar.gz

  2. collection name: compact_rIji3Y2O

  3. Failed time: [2022-07-25T19:04:35.740Z] [gw4] [ 24%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index

binbinlv avatar Jul 26 '22 02:07 binbinlv

This issue appears continuously nearly every nightly, and the latest is: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/25/pipeline/123

  1. milvus log: artifacts-milvus-standalone-nightly-25-pymilvus-e2e-logs.tar.gz
  2. collection name: compact_gFARsRGw
  3. Failed time: [2022-08-08T19:02:16.097Z] [gw0] [ 29%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_index

binbinlv avatar Aug 09 '22 01:08 binbinlv

This issue appears again: https://jenkins.milvus.io:18080/blue/rest/organizations/jenkins/pipelines/qa/pipelines/milvus-release-nightly/runs/57/nodes/142/steps/165/log/?start=0

Taking one failed case (test_compact_manual_and_auto) as an example:

  1. milvus log: milvus-standalone-57-pymilvus-e2e-logs.tar.gz
  2. collection name: compact_Jzy7BynZ
  3. Failed timeline: [2022-08-23T23:22:13.408Z] [gw5] [ 14%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_manual_and_auto
  4. Failed case:
[2022-08-24T00:06:05.825Z]     @pytest.mark.tags(CaseLabel.L1)

[2022-08-24T00:06:05.825Z]     def test_compact_manual_and_auto(self):

[2022-08-24T00:06:05.825Z]         """

[2022-08-24T00:06:05.825Z]         target: test compact manual and auto

[2022-08-24T00:06:05.825Z]         method: 1.create with shard_num=1

[2022-08-24T00:06:05.825Z]                 2.insert one and flush (11 times)

[2022-08-24T00:06:05.825Z]                 3.compact

[2022-08-24T00:06:05.825Z]                 4.load and search

[2022-08-24T00:06:05.825Z]         expected: Verify segments info

[2022-08-24T00:06:05.825Z]         """

[2022-08-24T00:06:05.825Z]         # greater than auto-merge threshold 10

[2022-08-24T00:06:05.825Z]         num_of_segment = ct.compact_segment_num_threshold + 1

[2022-08-24T00:06:05.825Z]     

[2022-08-24T00:06:05.825Z]         # create collection shard_num=1, insert 11 segments, each with one entity

[2022-08-24T00:06:05.825Z]         collection_w = self.collection_insert_multi_segments_one_shard(prefix, num_of_segment=num_of_segment)

[2022-08-24T00:06:05.825Z]     

[2022-08-24T00:06:05.825Z]         # waiting for auto compaction finished

[2022-08-24T00:06:05.825Z]         collection_w.collection.compaction_id = 0

[2022-08-24T00:06:05.825Z] >       collection_w.wait_for_compaction_completed()

binbinlv avatar Aug 24 '22 02:08 binbinlv

Not only kafka issue, so cancel kafka only label

binbinlv avatar Aug 24 '22 02:08 binbinlv

@jiaoew1991

hello, enwei, I see you changed this case yesterday. Could you help to investigate this issue? Thanks.

binbinlv avatar Aug 24 '22 02:08 binbinlv

the testcase which I changed yesterday is not right, after discussed with @ThreadDao should change the waiting for compaction completed to sleep 1minute

jiaoew1991 avatar Aug 24 '22 05:08 jiaoew1991

The issue continuously appears in both master and 2.1.*

Taking one failed case (test_compact_only_growing_segment) as an example:

  1. Collection name: compact_92BffaT3
  2. milvus log: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/57/artifacts
  3. Failed timeline: [2022-08-24T19:03:19.003Z] [gw2] [ 13%] FAILED testcases/test_compaction.py::TestCompactionParams::test_compact_only_growing_segment

NicoYuan1986 avatar Aug 25 '22 03:08 NicoYuan1986

The issue continuously appears in both master and 2.1.*

Taking one failed case (test_compact_manual_and_auto) as an example:

  1. milvus:2.1.*(a9eb72e5) pymilvus: 2.1.2.dev2

  2. collection name : 'compact_pGNmV856'

  3. log:milvus-standalone-67-pymilvus-e2e-logs.tar.gz

[2022-08-29T00:04:47.692Z] [2022-08-28 23:21:11 - ERROR - pymilvus.decorators]: RPC error: [wait_for_compaction_completed], <MilvusException: (code=1, message=Get compaction state timeout)>, <Time:{'RPC start': '2022-08-28 23:20:11.410363', 'RPC error': '2022-08-28 23:21:11.624625'}> (decorators.py:95)

NicoYuan1986 avatar Aug 29 '22 08:08 NicoYuan1986

The issue steadily appears every day. Taking one failed case (test_compact_manual_and_auto) as an example:

  1. milvus: 2.1.* pymilvus:2.1.2.dev2
  2. collection name: compact_QlHsE95l
  3. log: milvus-standalone-81-pymilvus-e2e-logs.tar.gz
[2022-09-05T00:14:23.791Z] [2022-09-04 23:22:33 - ERROR - pymilvus.decorators]: RPC error: [wait_for_compaction_completed], <MilvusException: (code=1, message=Get compaction state timeout)>, <Time:{'RPC start': '2022-09-04 23:21:33.329105', 'RPC error': '2022-09-04 23:22:33.545751'}> (decorators.py:95)

NicoYuan1986 avatar Sep 05 '22 06:09 NicoYuan1986

This issue no longer occurs in master(c58a8d8).

NicoYuan1986 avatar Sep 28 '22 03:09 NicoYuan1986