milvus
milvus copied to clipboard
[Bug]: [Nightly] Auto compaction sometimes runs slowly
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: latest
- Deployment mode(standalone or cluster): standalone & cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus rc9.22
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Auto compaction (segments reached 10), compaction could not be completed in 60s
[2021-12-09T00:38:55.908Z] > raise BaseException(1, "Ccompact auto-merge more than 60s")
[2021-12-09T00:38:55.908Z] E BaseException: (1, 'Ccompact auto-merge more than 60s')
Expected Behavior
Auto compaction (segments reached 10), compaction could be completed in relatively short time.
Steps To Reproduce
Run nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/256/pipeline/106
Logs: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/256/artifacts
Failed case: testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge
Failed collection name: compact_QmYxDTZS
TimeLine:
[2021-12-08T22:07:42.854Z] [gw2] [ 17%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge
Anything else?
No response
/unassign /assign @XuanYang-cn
Seems like auto-compaction is off. Please turn the auto-compaction on and test again.
/assign @binbinlv /unassign
/assign @ThreadDao please help to update the latest status of these tests /unassign @binbinlv
Now auto-compaction is on and the problem did not reappear.
This issue appears again, so reopen it. Nightly: [standalone: artifacts-milvus-standalone-master-276-pymilvus-e2e-logs.tar.gz] https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/276/artifacts
Test logs: [artifacts-milvus-standalone-pymilvus-pytest-logs.tar.gz] https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/276/artifacts
Failed case: test_compact_threshold_auto_merge
> raise BaseException(1, "Compact auto-merge more than 60s")
E BaseException: (1, 'Compact auto-merge more than 60s')
Failed collection name: compact_0O3zxHRH
Timeline: [2021-12-24T15:40:40.730Z] testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge FAILED
The reason is:
Load arrived before auto-compaction finished. So querynode loaded the segments before merging.
1829791:[2021/12/24 15:39:39.347 +00:00] [DEBUG] [task.go:333] ["loadCollectionTask: get collection's all partitionIDs"]
1829899:[2021/12/24 15:39:39.369 +00:00] [DEBUG] [services.go:747] ["success to complete compaction"] [planID=430010631267024898]
And handoff failed because handoff happened during load
1829941:[2021/12/24 15:39:39.371 +00:00] [DEBUG] [query_coord.go:431] ["watchHandoffSegmentLoop: collection/partition has not been loaded or autoHandoff equal to false, remove req from etcd"]
The case:
def test_compact_threshold_auto_merge(self):
"""
target: test num (segment_size < 1/2Max) reaches auto-merge threshold 10
method: 1.create with shard_num=1
2.insert flush 10 times (merge threshold 10)
3.wait for compaction, load
expected: Get query segments info to verify segments auto-merged into one
"""
threshold = ct.compact_segment_num_threshold
# create collection shard_num=1, insert 10 segments, each with one entity
collection_w = self.collection_insert_multi_segments_one_shard(prefix, num_of_segment=threshold)
# Estimated auto-merging takes 30s
cost = 60
collection_w.load()
start = time()
while True:
sleep(5)
segments_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
# verify segments reaches threshold, auto-merge ten segments into one
if len(segments_info) == 1:
break
end = time()
if end - start > cost:
> raise BaseException(1, "Compact auto-merge more than 60s")
E BaseException: (1, 'Compact auto-merge more than 60s')
/assign @xige-16 /unassign
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.
@xige-16 is this a duplicated issue?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.
It occurs again in the following nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/weekly-milvus-nightly-ci%2Fweekly-milvus-ci-test/detail/weekly-milvus-ci-test/48/pipeline
-
Test log: [standalone: [milvus-standalone-48-pymilvus-e2e-logs.tar.gz] https://ci.milvus.io:18080/jenkins/job/weekly-milvus-nightly-ci/job/weekly-milvus-ci-test/48/artifact/milvus-standalone-48-pymilvus-e2e-logs.tar.gz
-
Test case fail time: [2022-02-25T13:51:58.128Z] [gw2] [ 23%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge
-
Test case:
[2022-02-25T14:44:41.162Z] __________ TestCompactionOperation.test_compact_threshold_auto_merge ___________
[2022-02-25T14:44:41.162Z] [gw2] linux -- Python 3.6.8 /usr/local/bin/python3
[2022-02-25T14:44:41.162Z]
[2022-02-25T14:44:41.162Z] self = <test_compaction.TestCompactionOperation object at 0x7fb6ddd5f7f0>
[2022-02-25T14:44:41.162Z]
[2022-02-25T14:44:41.162Z] @pytest.mark.tags(CaseLabel.L2)
[2022-02-25T14:44:41.162Z] def test_compact_threshold_auto_merge(self):
[2022-02-25T14:44:41.162Z] """
[2022-02-25T14:44:41.162Z] target: test num (segment_size < 1/2Max) reaches auto-merge threshold 10
[2022-02-25T14:44:41.162Z] method: 1.create with shard_num=1
[2022-02-25T14:44:41.162Z] 2.insert flush 10 times (merge threshold 10)
[2022-02-25T14:44:41.162Z] 3.wait for compaction, load
[2022-02-25T14:44:41.162Z] expected: Get query segments info to verify segments auto-merged into one
[2022-02-25T14:44:41.162Z] """
[2022-02-25T14:44:41.162Z] threshold = ct.compact_segment_num_threshold
[2022-02-25T14:44:41.162Z]
[2022-02-25T14:44:41.162Z] # create collection shard_num=1, insert 10 segments, each with one entity
[2022-02-25T14:44:41.162Z] collection_w = self.collection_insert_multi_segments_one_shard(prefix, num_of_segment=threshold)
[2022-02-25T14:44:41.162Z]
[2022-02-25T14:44:41.162Z] # Estimated auto-merging takes 30s
[2022-02-25T14:44:41.162Z] cost = 60
[2022-02-25T14:44:41.162Z] collection_w.load()
[2022-02-25T14:44:41.162Z] start = time()
[2022-02-25T14:44:41.162Z] while True:
[2022-02-25T14:44:41.162Z] sleep(5)
[2022-02-25T14:44:41.162Z] segments_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
[2022-02-25T14:44:41.162Z]
[2022-02-25T14:44:41.162Z] # verify segments reaches threshold, auto-merge ten segments into one
[2022-02-25T14:44:41.162Z] if len(segments_info) == 1:
[2022-02-25T14:44:41.162Z] break
[2022-02-25T14:44:41.162Z] end = time()
[2022-02-25T14:44:41.162Z] if end - start > cost:
[2022-02-25T14:44:41.162Z] > raise BaseException(1, "Compact auto-merge more than 60s")
[2022-02-25T14:44:41.162Z] E BaseException: (1, 'Compact auto-merge more than 60s')
@xige-16 Could you help to have a look at this issue, thanks.
It occurs again in the following nightly: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/372/pipeline/106
1.log artifacts-milvus-standalone-master-372-pymilvus-e2e-logs.tar.gz
-
Test case fail time: [2022-02-28T19:03:02.582Z] [gw2] [ 22%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge
-
collection name: compact_OkYOJVuV
-
Test case:
"""
target: test num (segment_size < 1/2Max) reaches auto-merge threshold 10
method: 1.create with shard_num=1
2.insert flush 10 times (merge threshold 10)
3.wait for compaction, load
expected: Get query segments info to verify segments auto-merged into one
"""
threshold = ct.compact_segment_num_threshold
# create collection shard_num=1, insert 10 segments, each with one entity
collection_w = self.collection_insert_multi_segments_one_shard(prefix, num_of_segment=threshold)
# Estimated auto-merging takes 30s
cost = 60
collection_w.load()
start = time()
while True:
sleep(5)
segments_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
# verify segments reaches threshold, auto-merge ten segments into one
if len(segments_info) == 1:
break
end = time()
if end - start > cost:
raise BaseException(1, "Compact auto-merge more than 60s")
BaseException: (1, 'Compact auto-merge more than 60s')
@jaime0815 could you please take a look at this issue? It reproduces frequently these days. /assign @jaime0815 /unassign
/assign /unassign @jaime0815
The reason is:
Loadarrived before auto-compaction finished. So querynode loaded the segments before merging.1829791:[2021/12/24 15:39:39.347 +00:00] [DEBUG] [task.go:333] ["loadCollectionTask: get collection's all partitionIDs"]1829899:[2021/12/24 15:39:39.369 +00:00] [DEBUG] [services.go:747] ["success to complete compaction"] [planID=430010631267024898]And
handofffailed because handoff happened during load1829941:[2021/12/24 15:39:39.371 +00:00] [DEBUG] [query_coord.go:431] ["watchHandoffSegmentLoop: collection/partition has not been loaded or autoHandoff equal to false, remove req from etcd"]
querycoord add collection info to meta cache when load collection done, before that the compaction has been completed and querycoord is notified, but when querycoord checks meta, it is found that the collection meta does not exist in the querycoord memory, and the segment after compaction is not loaded into the querynode.
I'll fix it soon.
It occurs again in the following nightly:
- Logs: milvus-distributed-75-pymilvus-e2e-logs.tar.gz
- Test failed time: [2022-03-06T06:31:44.414Z] [gw3] [ 23%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge
- collection name: compact_L5HV6of5
The reason is:
Loadarrived before auto-compaction finished. So querynode loaded the segments before merging.1829791:[2021/12/24 15:39:39.347 +00:00] [DEBUG] [task.go:333] ["loadCollectionTask: get collection's all partitionIDs"]1829899:[2021/12/24 15:39:39.369 +00:00] [DEBUG] [services.go:747] ["success to complete compaction"] [planID=430010631267024898]And
handofffailed because handoff happened during load1829941:[2021/12/24 15:39:39.371 +00:00] [DEBUG] [query_coord.go:431] ["watchHandoffSegmentLoop: collection/partition has not been loaded or autoHandoff equal to false, remove req from etcd"]querycoord add collection info to meta cache when load collection done, before that the compaction has been completed and querycoord is notified, but when querycoord checks meta, it is found that the collection meta does not exist in the querycoord memory, and the segment after compaction is not loaded into the querynode.
I'll fix it soon.
We can not apply patch by patch to fix the issues one by one. For example, if load operation start, the collection should be in meta already with a loading state. Once the load finish, the collection state can be changed to "loaded". I'm really worried we patched too many to the logic and make it hard to maintain(Shouldn't rely on ordering because in a distributed system one can not consider all the concurrent cases)
The test case "test_compact_after_binary_index" failed for "BaseException: (1, 'Handoff after compact and index cost more than 1s')" in the following nightly:
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/386/pipeline/
I think the root cause is the same with this issue, @xige-16 could you help to confirm? Thanks.
1 Log: artifacts-milvus-distributed-master-386-pymilvus-e2e-logs.tar.gz
2 collection name: "compact_igouk2W2"
3Test case failed time: [2022-03-07T19:03:52.610Z] [gw0] [ 12%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_after_binary_index
4 Test case:
def test_compact_after_binary_index(self):
"""
target: test compact after create index
method: 1.insert binary data into two segments
2.create binary index
3.compact
4.search
expected: Verify segment info and index info
"""
# create collection with 1 shard and insert 2 segments
collection_w = self.init_collection_wrap(name=cf.gen_unique_str(prefix), shards_num=1,
schema=cf.gen_default_binary_collection_schema())
for i in range(2):
df, _ = cf.gen_default_binary_dataframe_data()
collection_w.insert(data=df)
assert collection_w.num_entities == (i + 1) * ct.default_nb
# create index
collection_w.create_index(ct.default_binary_vec_field_name, ct.default_binary_index)
log.debug(collection_w.index())
# load and search
collection_w.load()
search_params = {"metric_type": "JACCARD", "params": {"nprobe": 32}}
search_res_one, _ = collection_w.search(df[ct.default_binary_vec_field_name][:ct.default_nq].to_list(),
ct.default_binary_vec_field_name, search_params, ct.default_limit)
# compact
collection_w.compact()
collection_w.wait_for_compaction_completed()
c_plans = collection_w.get_compaction_plans(check_task=CheckTasks.check_merge_compact)[0]
# waiting for handoff completed and search
cost = 1
start = time()
while True:
sleep(5)
segment_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
if len(segment_info) != 0 and segment_info[0].segmentID == c_plans.plans[0].target:
log.debug(segment_info)
break
if time() - start > cost:
raise BaseException(1, f"Handoff after compact and index cost more than {cost}s")
And could you help to confirm whether the "1s" threshold is reasonable?
Now the test case waits handoff complete for only 1 second. It's reasonable when there is only when test case running. While in CI env, there will be 6 workers running different cases. Since hand-off operation may be blocked by query/search task put into QueryChannel before, it has to wait until the Search/Query operation is done. Maybe we shall adjust this timeout value to a more reasonable value. @binbinlv @ThreadDao
Auto compaction slowly again in the following nightly tests: https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/398/pipeline
1.log artifacts-milvus-standalone-master-398-pymilvus-e2e-logs.tar.gz
2 Test case fail time: [2022-03-13T19:03:10.131Z] [gw2] [ 23%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_threshold_auto_merge
3 collection name: compact_bf6sM5XX
The reason is:
Loadarrived before auto-compaction finished. So querynode loaded the segments before merging.1829791:[2021/12/24 15:39:39.347 +00:00] [DEBUG] [task.go:333] ["loadCollectionTask: get collection's all partitionIDs"]1829899:[2021/12/24 15:39:39.369 +00:00] [DEBUG] [services.go:747] ["success to complete compaction"] [planID=430010631267024898]And
handofffailed because handoff happened during load1829941:[2021/12/24 15:39:39.371 +00:00] [DEBUG] [query_coord.go:431] ["watchHandoffSegmentLoop: collection/partition has not been loaded or autoHandoff equal to false, remove req from etcd"]querycoord add collection info to meta cache when load collection done, before that the compaction has been completed and querycoord is notified, but when querycoord checks meta, it is found that the collection meta does not exist in the querycoord memory, and the segment after compaction is not loaded into the querynode. I'll fix it soon.
We can not apply patch by patch to fix the issues one by one. For example, if load operation start, the collection should be in meta already with a loading state. Once the load finish, the collection state can be changed to "loaded". I'm really worried we patched too many to the logic and make it hard to maintain(Shouldn't rely on ordering because in a distributed system one can not consider all the concurrent cases)
@xiaofan-luan @xige-16 so what is the new solution here, any progress?
@xige-16 we should fix this by adding a lease on datacoord side to maintain data existence when compaction happened.
releated with #15662
@xige-16 so what is the expected behaviour?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.
Keep open for verification. (this did not merge any fix, and will verify whether it has been reproduced recently.
It has not been reproduced for a period of time, so close it first and will reopen it if it occurs in the future.
It is reproduced in the latest nightly, so reopen it.