milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [Nightly] Auto compaction sometimes runs slowly

Open binbinlv opened this issue 3 years ago • 50 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: latest
- Deployment mode(standalone or cluster): standalone & cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus rc9.22
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Auto compaction (segments reached 10), compaction could not be completed in 60s

[2021-12-09T00:38:55.908Z] >               raise BaseException(1, "Ccompact auto-merge more than 60s")
[2021-12-09T00:38:55.908Z] E               BaseException: (1, 'Ccompact auto-merge more than 60s')

Expected Behavior

Auto compaction (segments reached 10), compaction could be completed in relatively short time.

Steps To Reproduce

Run nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/256/pipeline/106

Logs: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/256/artifacts

Failed case: testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge

Failed collection name: compact_QmYxDTZS

TimeLine:

[2021-12-08T22:07:42.854Z] [gw2] [ 17%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge 

Anything else?

No response

binbinlv avatar Dec 09 '21 03:12 binbinlv

/unassign /assign @XuanYang-cn

congqixia avatar Dec 09 '21 08:12 congqixia

Seems like auto-compaction is off. Please turn the auto-compaction on and test again.

/assign @binbinlv /unassign

XuanYang-cn avatar Dec 09 '21 09:12 XuanYang-cn

/assign @ThreadDao please help to update the latest status of these tests /unassign @binbinlv

yanliang567 avatar Dec 14 '21 02:12 yanliang567

Now auto-compaction is on and the problem did not reappear.

ThreadDao avatar Dec 14 '21 03:12 ThreadDao

This issue appears again, so reopen it. Nightly: [standalone: artifacts-milvus-standalone-master-276-pymilvus-e2e-logs.tar.gz] https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/276/artifacts

Test logs: [artifacts-milvus-standalone-pymilvus-pytest-logs.tar.gz] https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/276/artifacts

Failed case: test_compact_threshold_auto_merge

>               raise BaseException(1, "Compact auto-merge more than 60s")
E               BaseException: (1, 'Compact auto-merge more than 60s')

Failed collection name: compact_0O3zxHRH

Timeline: [2021-12-24T15:40:40.730Z] testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge FAILED

binbinlv avatar Dec 25 '21 02:12 binbinlv

The reason is:

Load arrived before auto-compaction finished. So querynode loaded the segments before merging.

1829791:[2021/12/24 15:39:39.347 +00:00] [DEBUG] [task.go:333] ["loadCollectionTask: get collection's all partitionIDs"]
1829899:[2021/12/24 15:39:39.369 +00:00] [DEBUG] [services.go:747] ["success to complete compaction"] [planID=430010631267024898]

And handoff failed because handoff happened during load

1829941:[2021/12/24 15:39:39.371 +00:00] [DEBUG] [query_coord.go:431] ["watchHandoffSegmentLoop: collection/partition has not been loaded or autoHandoff equal to false, remove req from etcd"]

XuanYang-cn avatar Dec 25 '21 03:12 XuanYang-cn

The case:

    def test_compact_threshold_auto_merge(self):
        """
        target: test num (segment_size < 1/2Max) reaches auto-merge threshold 10
        method: 1.create with shard_num=1
                2.insert flush 10 times (merge threshold 10)
                3.wait for compaction, load
        expected: Get query segments info to verify segments auto-merged into one
        """
        threshold = ct.compact_segment_num_threshold
    
        # create collection shard_num=1, insert 10 segments, each with one entity
        collection_w = self.collection_insert_multi_segments_one_shard(prefix, num_of_segment=threshold)
    
        # Estimated auto-merging takes 30s
        cost = 60
        collection_w.load()
        start = time()
        while True:
            sleep(5)
            segments_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
    
            # verify segments reaches threshold, auto-merge ten segments into one
            if len(segments_info) == 1:
                break
            end = time()
            if end - start > cost:
>               raise BaseException(1, "Compact auto-merge more than 60s")
E               BaseException: (1, 'Compact auto-merge more than 60s')

XuanYang-cn avatar Dec 25 '21 03:12 XuanYang-cn

/assign @xige-16 /unassign

XuanYang-cn avatar Dec 25 '21 03:12 XuanYang-cn

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jan 24 '22 05:01 stale[bot]

@xige-16 is this a duplicated issue?

xiaofan-luan avatar Jan 25 '22 14:01 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Feb 24 '22 15:02 stale[bot]

It occurs again in the following nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/weekly-milvus-nightly-ci%2Fweekly-milvus-ci-test/detail/weekly-milvus-ci-test/48/pipeline

  1. Test log: [standalone: [milvus-standalone-48-pymilvus-e2e-logs.tar.gz] https://ci.milvus.io:18080/jenkins/job/weekly-milvus-nightly-ci/job/weekly-milvus-ci-test/48/artifact/milvus-standalone-48-pymilvus-e2e-logs.tar.gz

  2. Test case fail time: [2022-02-25T13:51:58.128Z] [gw2] [ 23%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge

  3. Test case:

[2022-02-25T14:44:41.162Z] __________ TestCompactionOperation.test_compact_threshold_auto_merge ___________
[2022-02-25T14:44:41.162Z] [gw2] linux -- Python 3.6.8 /usr/local/bin/python3
[2022-02-25T14:44:41.162Z] 
[2022-02-25T14:44:41.162Z] self = <test_compaction.TestCompactionOperation object at 0x7fb6ddd5f7f0>
[2022-02-25T14:44:41.162Z] 
[2022-02-25T14:44:41.162Z]     @pytest.mark.tags(CaseLabel.L2)
[2022-02-25T14:44:41.162Z]     def test_compact_threshold_auto_merge(self):
[2022-02-25T14:44:41.162Z]         """
[2022-02-25T14:44:41.162Z]         target: test num (segment_size < 1/2Max) reaches auto-merge threshold 10
[2022-02-25T14:44:41.162Z]         method: 1.create with shard_num=1
[2022-02-25T14:44:41.162Z]                 2.insert flush 10 times (merge threshold 10)
[2022-02-25T14:44:41.162Z]                 3.wait for compaction, load
[2022-02-25T14:44:41.162Z]         expected: Get query segments info to verify segments auto-merged into one
[2022-02-25T14:44:41.162Z]         """
[2022-02-25T14:44:41.162Z]         threshold = ct.compact_segment_num_threshold
[2022-02-25T14:44:41.162Z]     
[2022-02-25T14:44:41.162Z]         # create collection shard_num=1, insert 10 segments, each with one entity
[2022-02-25T14:44:41.162Z]         collection_w = self.collection_insert_multi_segments_one_shard(prefix, num_of_segment=threshold)
[2022-02-25T14:44:41.162Z]     
[2022-02-25T14:44:41.162Z]         # Estimated auto-merging takes 30s
[2022-02-25T14:44:41.162Z]         cost = 60
[2022-02-25T14:44:41.162Z]         collection_w.load()
[2022-02-25T14:44:41.162Z]         start = time()
[2022-02-25T14:44:41.162Z]         while True:
[2022-02-25T14:44:41.162Z]             sleep(5)
[2022-02-25T14:44:41.162Z]             segments_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
[2022-02-25T14:44:41.162Z]     
[2022-02-25T14:44:41.162Z]             # verify segments reaches threshold, auto-merge ten segments into one
[2022-02-25T14:44:41.162Z]             if len(segments_info) == 1:
[2022-02-25T14:44:41.162Z]                 break
[2022-02-25T14:44:41.162Z]             end = time()
[2022-02-25T14:44:41.162Z]             if end - start > cost:
[2022-02-25T14:44:41.162Z] >               raise BaseException(1, "Compact auto-merge more than 60s")
[2022-02-25T14:44:41.162Z] E               BaseException: (1, 'Compact auto-merge more than 60s')

binbinlv avatar Feb 28 '22 03:02 binbinlv

@xige-16 Could you help to have a look at this issue, thanks.

binbinlv avatar Feb 28 '22 04:02 binbinlv

It occurs again in the following nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/372/pipeline/106

1.log artifacts-milvus-standalone-master-372-pymilvus-e2e-logs.tar.gz

  1. Test case fail time: [2022-02-28T19:03:02.582Z] [gw2] [ 22%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge

  2. collection name: compact_OkYOJVuV

  3. Test case:

       """
      target: test num (segment_size < 1/2Max) reaches auto-merge threshold 10
      method: 1.create with shard_num=1
                     2.insert flush 10 times (merge threshold 10)
                     3.wait for compaction, load
      expected: Get query segments info to verify segments auto-merged into one
      """
     threshold = ct.compact_segment_num_threshold
  
     # create collection shard_num=1, insert 10 segments, each with one entity
     collection_w = self.collection_insert_multi_segments_one_shard(prefix, num_of_segment=threshold)

     # Estimated auto-merging takes 30s
     cost = 60
     collection_w.load()
     start = time()
     while True:
     sleep(5)
     segments_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]

      # verify segments reaches threshold, auto-merge ten segments into one
      if len(segments_info) == 1:
      break
      end = time()
      if end - start > cost:
      raise BaseException(1, "Compact auto-merge more than 60s")
      BaseException: (1, 'Compact auto-merge more than 60s')

binbinlv avatar Mar 01 '22 02:03 binbinlv

@jaime0815 could you please take a look at this issue? It reproduces frequently these days. /assign @jaime0815 /unassign

yanliang567 avatar Mar 01 '22 02:03 yanliang567

/assign /unassign @jaime0815

xige-16 avatar Mar 01 '22 02:03 xige-16

The reason is:

Load arrived before auto-compaction finished. So querynode loaded the segments before merging.

1829791:[2021/12/24 15:39:39.347 +00:00] [DEBUG] [task.go:333] ["loadCollectionTask: get collection's all partitionIDs"]
1829899:[2021/12/24 15:39:39.369 +00:00] [DEBUG] [services.go:747] ["success to complete compaction"] [planID=430010631267024898]

And handoff failed because handoff happened during load

1829941:[2021/12/24 15:39:39.371 +00:00] [DEBUG] [query_coord.go:431] ["watchHandoffSegmentLoop: collection/partition has not been loaded or autoHandoff equal to false, remove req from etcd"]

querycoord add collection info to meta cache when load collection done, before that the compaction has been completed and querycoord is notified, but when querycoord checks meta, it is found that the collection meta does not exist in the querycoord memory, and the segment after compaction is not loaded into the querynode.

I'll fix it soon.

xige-16 avatar Mar 01 '22 02:03 xige-16

It occurs again in the following nightly:

  1. Logs: milvus-distributed-75-pymilvus-e2e-logs.tar.gz
  2. Test failed time: [2022-03-06T06:31:44.414Z] [gw3] [ 23%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge
  3. collection name: compact_L5HV6of5

binbinlv avatar Mar 07 '22 02:03 binbinlv

The reason is: Load arrived before auto-compaction finished. So querynode loaded the segments before merging.

1829791:[2021/12/24 15:39:39.347 +00:00] [DEBUG] [task.go:333] ["loadCollectionTask: get collection's all partitionIDs"]
1829899:[2021/12/24 15:39:39.369 +00:00] [DEBUG] [services.go:747] ["success to complete compaction"] [planID=430010631267024898]

And handoff failed because handoff happened during load

1829941:[2021/12/24 15:39:39.371 +00:00] [DEBUG] [query_coord.go:431] ["watchHandoffSegmentLoop: collection/partition has not been loaded or autoHandoff equal to false, remove req from etcd"]

querycoord add collection info to meta cache when load collection done, before that the compaction has been completed and querycoord is notified, but when querycoord checks meta, it is found that the collection meta does not exist in the querycoord memory, and the segment after compaction is not loaded into the querynode.

I'll fix it soon.

We can not apply patch by patch to fix the issues one by one. For example, if load operation start, the collection should be in meta already with a loading state. Once the load finish, the collection state can be changed to "loaded". I'm really worried we patched too many to the logic and make it hard to maintain(Shouldn't rely on ordering because in a distributed system one can not consider all the concurrent cases)

xiaofan-luan avatar Mar 07 '22 03:03 xiaofan-luan

The test case "test_compact_after_binary_index" failed for "BaseException: (1, 'Handoff after compact and index cost more than 1s')" in the following nightly:

https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/386/pipeline/

I think the root cause is the same with this issue, @xige-16 could you help to confirm? Thanks.

1 Log: artifacts-milvus-distributed-master-386-pymilvus-e2e-logs.tar.gz

2 collection name: "compact_igouk2W2"

3Test case failed time: [2022-03-07T19:03:52.610Z] [gw0] [ 12%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_after_binary_index

4 Test case:

    def test_compact_after_binary_index(self):
        """
        target: test compact after create index
        method: 1.insert binary data into two segments
                2.create binary index
                3.compact
                4.search
        expected: Verify segment info and index info
        """
        # create collection with 1 shard and insert 2 segments
        collection_w = self.init_collection_wrap(name=cf.gen_unique_str(prefix), shards_num=1,
                                                 schema=cf.gen_default_binary_collection_schema())
        for i in range(2):
            df, _ = cf.gen_default_binary_dataframe_data()
            collection_w.insert(data=df)
            assert collection_w.num_entities == (i + 1) * ct.default_nb

        # create index
        collection_w.create_index(ct.default_binary_vec_field_name, ct.default_binary_index)
        log.debug(collection_w.index())

        # load and search
        collection_w.load()
        search_params = {"metric_type": "JACCARD", "params": {"nprobe": 32}}
        search_res_one, _ = collection_w.search(df[ct.default_binary_vec_field_name][:ct.default_nq].to_list(),
                                                ct.default_binary_vec_field_name, search_params, ct.default_limit)

        # compact
        collection_w.compact()
        collection_w.wait_for_compaction_completed()
        c_plans = collection_w.get_compaction_plans(check_task=CheckTasks.check_merge_compact)[0]

        # waiting for handoff completed and search
        cost = 1
        start = time()
        while True:
            sleep(5)
            segment_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
            if len(segment_info) != 0 and segment_info[0].segmentID == c_plans.plans[0].target:
                log.debug(segment_info)
                break
            if time() - start > cost:
                raise BaseException(1, f"Handoff after compact and index cost more than {cost}s")

And could you help to confirm whether the "1s" threshold is reasonable?

binbinlv avatar Mar 08 '22 02:03 binbinlv

Now the test case waits handoff complete for only 1 second. It's reasonable when there is only when test case running. While in CI env, there will be 6 workers running different cases. Since hand-off operation may be blocked by query/search task put into QueryChannel before, it has to wait until the Search/Query operation is done. Maybe we shall adjust this timeout value to a more reasonable value. @binbinlv @ThreadDao

congqixia avatar Mar 08 '22 08:03 congqixia

Auto compaction slowly again in the following nightly tests: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/398/pipeline

1.log artifacts-milvus-standalone-master-398-pymilvus-e2e-logs.tar.gz

2 Test case fail time: [2022-03-13T19:03:10.131Z] [gw2] [ 23%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge

3 collection name: compact_bf6sM5XX

binbinlv avatar Mar 14 '22 02:03 binbinlv

The reason is: Load arrived before auto-compaction finished. So querynode loaded the segments before merging.

1829791:[2021/12/24 15:39:39.347 +00:00] [DEBUG] [task.go:333] ["loadCollectionTask: get collection's all partitionIDs"]
1829899:[2021/12/24 15:39:39.369 +00:00] [DEBUG] [services.go:747] ["success to complete compaction"] [planID=430010631267024898]

And handoff failed because handoff happened during load

1829941:[2021/12/24 15:39:39.371 +00:00] [DEBUG] [query_coord.go:431] ["watchHandoffSegmentLoop: collection/partition has not been loaded or autoHandoff equal to false, remove req from etcd"]

querycoord add collection info to meta cache when load collection done, before that the compaction has been completed and querycoord is notified, but when querycoord checks meta, it is found that the collection meta does not exist in the querycoord memory, and the segment after compaction is not loaded into the querynode. I'll fix it soon.

We can not apply patch by patch to fix the issues one by one. For example, if load operation start, the collection should be in meta already with a loading state. Once the load finish, the collection state can be changed to "loaded". I'm really worried we patched too many to the logic and make it hard to maintain(Shouldn't rely on ordering because in a distributed system one can not consider all the concurrent cases)

@xiaofan-luan @xige-16 so what is the new solution here, any progress?

binbinlv avatar Mar 14 '22 02:03 binbinlv

@xige-16 we should fix this by adding a lease on datacoord side to maintain data existence when compaction happened.

xiaofan-luan avatar Mar 14 '22 04:03 xiaofan-luan

releated with #15662

xige-16 avatar Mar 15 '22 09:03 xige-16

@xige-16 so what is the expected behaviour?

xiaofan-luan avatar Mar 28 '22 07:03 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Apr 27 '22 07:04 stale[bot]

Keep open for verification. (this did not merge any fix, and will verify whether it has been reproduced recently.

binbinlv avatar May 05 '22 02:05 binbinlv

It has not been reproduced for a period of time, so close it first and will reopen it if it occurs in the future.

binbinlv avatar May 24 '22 03:05 binbinlv

It is reproduced in the latest nightly, so reopen it.

binbinlv avatar Jun 21 '22 02:06 binbinlv