milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Querycoord panic after restarting docker with error `set empty delta channel info to meta of collection 435211660870549505`

Open zhuwenxing opened this issue 2 years ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20220811-6c3dbf0
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.2.0.dev6
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The search failed after restarting docker

RPC error: [search], <MilvusException: (code=1, message=checkIfLoaded failed when search, collection:sift_128_euclidean, partitions:[], err = GetCollectionInfo failed, collection = sift_128_euclidean, err = err: find no available querycoord, check querycoord state

Search...
, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:259 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:160 github.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).ShowCollections
/go/src/github.com/milvus-io/milvus/internal/proxy/meta_cache.go:210 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetCollectionInfo
/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:439 github.com/milvus-io/milvus/internal/proxy.checkIfLoaded
/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:[20](https://github.com/zhuwenxing/milvus/runs/7781667318?check_suite_focus=true#step:15:21)1 github.com/milvus-io/milvus/internal/proxy.(*searchTask).PreExecute
/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:452 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask
/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit
)>, <Time:{'RPC start': '2022-08-11 07:04:24.678108', 'RPC error': '2022-08-11 07:04:29.858532'}>
Traceback (most recent call last):
  File "scripts/second_recall_test.py", line 64, in <module>
    search_test(host)
  File "scripts/second_recall_test.py", line 33, in search_test
    res = collection.search(
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 717, in search
    res = conn.search(self._name, data, anns_field, param, limit, expr,
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 96, in handler
    raise e
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 92, in handler
    return func(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 74, in handler
    raise e
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 48, in handler
    return func(self, *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 451, in search
    return self._execute_search_requests(requests, timeout, **_kwargs)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 415, in _execute_search_requests
    raise pre_err
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 406, in _execute_search_requests
    raise MilvusException(response.status.error_code, response.status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=checkIfLoaded failed when search, collection:sift_128_euclidean, partitions:[], err = GetCollectionInfo failed, collection = sift_128_euclidean, err = err: find no available querycoord, check querycoord state
, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:259 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:160 github.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).ShowCollections
/go/src/github.com/milvus-io/milvus/internal/proxy/meta_cache.go:[21](https://github.com/zhuwenxing/milvus/runs/7781667318?check_suite_focus=true#step:15:22)0 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetCollectionInfo
/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:439 github.com/milvus-io/milvus/internal/proxy.checkIfLoaded
/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:201 github.com/milvus-io/milvus/internal/proxy.(*searchTask).PreExecute
/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:452 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask
/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit
)>

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/zhuwenxing/milvus/runs/7781667318?check_suite_focus=true

Milvus Log

failed job: https://github.com/zhuwenxing/milvus/runs/7781667318?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7765344016/artifacts/326509308

Anything else?

No response

zhuwenxing avatar Aug 11 '22 07:08 zhuwenxing

image

zhuwenxing avatar Aug 11 '22 07:08 zhuwenxing

same for standalone failed job: https://github.com/zhuwenxing/milvus/runs/7781667170?check_suite_focus=true log:https://github.com/zhuwenxing/milvus/suites/7765344016/artifacts/326509309 image

zhuwenxing avatar Aug 11 '22 07:08 zhuwenxing

https://github.com/zhuwenxing/milvus/suites/7765344016/artifacts/326509309

all log page is 404!

weiliu1031 avatar Aug 11 '22 09:08 weiliu1031

@weiliu1031 Since I have rerun the failed job, the log link has changed. you can check the log below. the error log is in dir third_deploy

failed job: https://github.com/zhuwenxing/milvus/actions/runs/2838417006 log: https://github.com/zhuwenxing/milvus/suites/7766597708/artifacts/326598084 https://github.com/zhuwenxing/milvus/suites/7766597708/artifacts/326598083

zhuwenxing avatar Aug 11 '22 09:08 zhuwenxing

/assign @weiliu1031

weiliu1031 avatar Aug 11 '22 09:08 weiliu1031

some information to sync: query coord panic due to showPartitions from root coord return 0 partition.

two more issues need track down:

  1. why root coord return 0 partition?
  2. query coord's behavior when getting some wrong infos.

weiliu1031 avatar Aug 11 '22 11:08 weiliu1031

/assign @longjiquan /unassign

weiliu1031 avatar Aug 11 '22 12:08 weiliu1031

caused by #18546 . Discussed with @jaime0815 , I removed the deprecated partitions in collection info stored in etcd and ignored the compatibility since it's not released yet.

longjiquan avatar Aug 11 '22 12:08 longjiquan

/assign @zhuwenxing

longjiquan avatar Aug 11 '22 12:08 longjiquan

The upgrade is from v2.0.1 to master-latest and is not between daily builds, so it needs more investigation

/unassign

zhuwenxing avatar Aug 15 '22 03:08 zhuwenxing

It reproduced stably

zhuwenxing avatar Aug 15 '22 08:08 zhuwenxing

version milvusdb/milvus-dev:longjiquan-debug-meta-partitions-229f0164a-20220815 failed job:https://github.com/zhuwenxing/milvus/runs/7835788538?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/7814773050/artifacts/330043167

zhuwenxing avatar Aug 15 '22 11:08 zhuwenxing

@zhuwenxing try to reproduce on 2.1.4->master

yanliang567 avatar Oct 14 '22 02:10 yanliang567

/assign @zhuwenxing Please help to reproduce this again. thx, @zhuwenxing

longjiquan avatar Oct 18 '22 03:10 longjiquan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Nov 17 '22 10:11 stale[bot]