milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: milvus-querynode pod hang ,and it would hang again after 20mins running if restart the hung pod

Open zhoushewei opened this issue 2 years ago • 11 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:V2.1.0
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.2
- OS(Ubuntu or CentOS): Ubuntu 18.04
- CPU/Memory: 6C,94GB
- GPU: 
- Others:

Current Behavior

当前环境属于新装环境,正常运行2天;cpu低于20%;内存低于50%;load average: 7.08 ;milvus库中表20张,3000维度+;整体数据约10万左右;milvus-querynode容器突然挂死,出现故障后重启milvus-querynode容器,正常20分钟后又突然挂死,再次重新拉起暂时正常;

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

业务系统异常信息(python开发,业务信息已经剔除): 2022-08-03 09:45:58.677 | INFO | data_pack:131 - ********** 2022-08-03 09:46:39.385 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=err: failed to connect 172.27.0.10:21123, reason: context deadline exceeded , /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:445 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard.func1 /go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:55 github.com/milvus-io/milvus/internal/proxy.roundRobinPolicy /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:465 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:324 github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute.func1.1 /go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 golang.org/x/sync/errgroup.(*Group).Go.func1 /usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit )> 2022-08-03 09:47:15.265 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=All attempts results: attempt #1:fail to get shard leaders from QueryCoord: no replica available attempt #2:fail to get shard leaders from QueryCoord: no replica available attempt #3:fail to get shard leaders from QueryCoord: no replica available attempt #4:fail to get shard leaders from QueryCoord: no replica available attempt #5:fail to get shard leaders from QueryCoord: no replica available attempt #6:fail to get shard leaders from QueryCoord: no replica available attempt #7:fail to get shard leaders from QueryCoord: no replica available attempt #8:context deadline exceeded )> ........ERROR..................... 2022-08-03 09:49:32.845 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=22, reason=ShardCluster for by-dev-rootcoord-dml_5_434981356418629633v1 replicaID 434981333510389770 is no available)>

milvus-querynode 容器异常信息 : [2022/08/03 01:46:01.305 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851788] [segmentID=435017738016522241] [segmentType=Growing] [loadIndex=false], terminate called after throwing an instance of 'faiss::FaissException', [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851796, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [434987259691794433] (170ms)"], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["start reduce search result, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (171ms)"], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:40] ["shard leader get valid search results"] [numbers=2], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=0] [nq=1] [topk=50], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=1] [nq=1] [topk=50], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:135] ["skip duplicated search result"] [count=3], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (171ms)"], [2022/08/03 01:46:01.311 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851796] [segmentID=435017738016522241] [segmentType=Growing] [loadIndex=false], [2022/08/03 01:46:01.315 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851788] [segmentID=435017458584125441] [segmentType=Growing] [loadIndex=false], [2022/08/03 01:46:01.321 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851788] [segmentID=434987259691794433] [segmentType=Sealed] [loadIndex=true], [2022/08/03 01:46:01.322 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851796] [segmentID=434987272261861377] [segmentType=Sealed] [loadIndex=true], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851788, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [434987259691794433] (374ms)"], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["start reduce search result, msgID = 435024794290851788, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (374ms)"], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:40] ["shard leader get valid search results"] [numbers=2], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=0] [nq=1] [topk=20], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=1] [nq=1] [topk=20], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:135] ["skip duplicated search result"] [count=1], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851788, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (375ms)"], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851796, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [434987272261861377] (187ms)"], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["start reduce search result, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [] (188ms)"], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:40] ["shard leader get valid search results"] [numbers=2], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=0] [nq=1] [topk=50], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=1] [nq=1] [topk=50], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:135] ["skip duplicated search result"] [count=3], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [] (188ms)"], terminate called recursively,

Anything else?

No response

zhoushewei avatar Aug 03 '22 02:08 zhoushewei

容器再次挂死,等着各位专家给出方案,感谢

2022/08/03 03:28:56.077 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435026539947045023] [segmentID=434981803007410177] [segmentType=Sealed] [loadIndex=false]

terminate called after throwing an instance of 'faiss::FaissException'

what(): Error in faiss::IndexIVF::search_preassigned(faiss::Index::idx_t, const float*, faiss::Index::idx_t, const idx_t*, const float*, float*, faiss::Index::idx_t*, bool, const faiss::IVFSearchParameters*, faiss::IndexIVFStats*, faiss::BitsetView) const::<lambda(faiss::Index::idx_t, float, float*, faiss::Index::idx_t*, faiss::BitsetView)> at /go/src/github.com/milvus-io/milvus/cmake_build/3rdparty_download/knowhere-subbuild/src/knowhere_ep/thirdparty/faiss/faiss/IndexIVF.cpp:563: Error: 'key < (idx_t)nlist' failed: Invalid key=1554658163683368960 nlist=223

terminate called recursively

terminate called recursively

zhoushewei avatar Aug 03 '22 03:08 zhoushewei

maybe duplicate with #17277

cydrain avatar Aug 03 '22 04:08 cydrain

mark it as urgent. @cydrain any fix ideas ?

xiaofan-luan avatar Aug 04 '22 01:08 xiaofan-luan

@cydrain @xiaofan-luan what's the status of #17353? found it in the previous issue but it was never merged.

congqixia avatar Aug 04 '22 02:08 congqixia

@cydrain @xiaofan-luan what's the status of #17353? found it in the previous issue but it was never merged.

as I explain in previous comments, we have not found the root cause, so this PR is not merged.

cydrain avatar Aug 04 '22 02:08 cydrain

query with following parameters:

index type: IVF_SQ8
metric type: L2
nlist: 1024
nprobe: 128

collect more terminate logs:

log1 querynode.log

[2022/08/03 12:20:33.638 +00:00] [DEBUG] [querynode/validate.go:110] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [querynode/validate.go:126] ["validateOnStreamReplica getSegmentIDsByVChannel"] [collectionID=434987254370533377] [vChannel=by-dev-rootcoord-dml_45_434987254370533377v1] [partitionIDs="[434987254370533378]"] [segmentIDs="[435032636368879617]"] [],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435035547805696869] [segmentID=435032636368879617] [segmentType=Growing] [loadIndex=false],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435035547805696869] [segmentID=435032636368879617] [segmentType=Growing] [loadIndex=false],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (353ms)"],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/03 12:20:33.640 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435035547805696869] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
terminate called without an active exception,
terminate called recursively,
terminate called recursively,

log2 (similar log with this issue)

[2022/08/03 13:09:46.424 +00:00] [DEBUG] [querynode/task_read.go:161] ["query msg can do"] [collectionID=434987254370533377] [sm.GuaranteeTimestamp=2022/08/03 13:09:46.064 +00:00] [serviceTime=2022/08/03 13:09:46.264 +00:00] ["delta milliseconds"=-200] [channel=by-dev-rootcoord-delta_45_434987254370533377v1] [msgID=435035547805863594],
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [querynode/task_read.go:161] ["query msg can do"] [collectionID=434987254370533377] [sm.GuaranteeTimestamp=2022/08/03 13:09:46.214 +00:00] [serviceTime=2022/08/03 13:09:46.264 +00:00] ["delta milliseconds"=-50] [channel=by-dev-rootcoord-delta_45_434987254370533377v1] [msgID=435035547805863602],
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (180ms)"],
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435035547805863602] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/03 13:09:46.425 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435035547805863594] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
[2022/08/03 13:09:46.426 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (354ms)"],
[2022/08/03 13:09:46.426 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
terminate called after throwing an instance of 'faiss::FaissException',
  what():  Error in faiss::IndexIVF::search_preassigned(faiss::Index::idx_t, const float*, faiss::Index::idx_t, const idx_t*, const float*, float*, faiss::Index::idx_t*, bool, const faiss::IVFSearchParameters*, faiss::IndexIVFStats*, faiss::BitsetView) const::<lambda(faiss::Index::idx_t, float, float*, faiss::Index::idx_t*, faiss::BitsetView)> at /go/src/github.com/milvus-io/milvus/cmake_build/3rdparty_download/knowhere-subbuild/src/knowhere_ep/thirdparty/faiss/faiss/IndexIVF.cpp:563: Error: 'key < (idx_t)nlist' failed: Invalid key=139928558108992 nlist=447,
,
terminate called recursively,
[2022/08/03 13:09:46.426 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435035547805863594] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],

log3 querynode-20220804-3.log

[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461524796] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (346ms)"],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461524788] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435048268461524796] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 01:50:49.227 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435048268461524796] [segmentID=435032636368879617] [segmentType=Growing] [loadIndex=false],
terminate called after throwing an instance of 'terminate called recursively,
faiss::FaissException',
  what():  

log4

[2022/08/04 02:45:28.243 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461633791] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (358ms)"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435048268461633793] [segmentID=435031740190294017] [segmentType=Growing] [loadIndex=false],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (350ms)"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (350ms)"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461633791] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461633793] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461633793] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
terminate called after throwing an instance of 'faiss::FaissExceptionterminate called recursively,

cydrain avatar Aug 04 '22 02:08 cydrain

/assign @cydrain /unassign

yanliang567 avatar Aug 04 '22 02:08 yanliang567

There are 2 issues in this case which cause process terminated

  1. a legacy check fail in faiss (Error: 'key < (idx_t)nlist' failed: Invalid key=139928558108992 nlist=447) -- it must be caused by memory overflow
  2. FaissException is thrown out in an omp code block in faiss (faiss/IndexIVF.cpp:563) -- it's a Faiss's bug

cydrain avatar Aug 04 '22 06:08 cydrain

Test memory overflow with tool valgrind or sanitize on Knowhere, no findings. val.log

cydrain avatar Aug 04 '22 10:08 cydrain

Try to build Milvus with knowhere enabled following sanitize parameters:

append_flags( CMAKE_CXX_FLAGS FLAGS "-fsanitize=address" "-fno-omit-frame-pointer" "-fno-stack-protector" "-fno-var-tracking" "-lasan" "-g" )

Launch Milvus in the normal way, Milvus cannot start up and report this error:

==14409==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

Launch Milvus with LD_PRELOAD command:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libasan.so.4 ./bin/milvus run standalone > /tmp/md/standalone.log 2>&1 

Milvus still cannot start and report this error:

==25312==Sanitizer CHECK failed: ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:40 ((allocated_for_dlsym)) < ((kDlsymAllocPoolSize)) (1111, 1024)

Googled solutions for this error, all not work.

cydrain avatar Aug 05 '22 00:08 cydrain

updated 2022/08/08

Some limitations about this issue:

  1. cannot get user data and script
  2. cannot reproduce this issue with randomly generate data 50000 (dim=3840) vector data, create IVF_SQ8 L2 index and do search repeatly, cannot reproduce this issue

Try following ways to debug, but no valuable findings:

  1. Using Address Sanitizer a. Enable sanitize options in Knowhere and run sift1M IVF_SQ8 benchmark, see no memory overflow b. Enable sanitize options in Knowhere and run unittest, no sanitize error c. Enable sanitize options in segcore and run unitest, no knowhere or faiss sanitize error sanitize_test_c_api.log sanitize_test_query.log d. Build Milvus with local Knowhere version (enable sanitize options), Milvus cannot launch
  2. Compare related API search_preassigned() between knowhere v1.0.1 and v1.1.5, no valuable findings.

cydrain avatar Aug 08 '22 08:08 cydrain

need follow customer steps to reproduce this issue:

  1. create collection (dim = 4096)
  2. load collection
  3. create index ("IVF_SQ8")
  4. insert one entity
  5. search all collection using this inserted vector
  6. if data in this collection exceed 10000, create a new collection
  7. repeat step 2~6

Generate python script issue_18509.py following above steps, but still cannot reproduce this issue.

cydrain avatar Aug 12 '22 10:08 cydrain

This issue is because, except IDMAP and ANNOY, all other knowhere index types are not thread-safe when searching:

  • BIN_IVF
  • IVF_FLAT
  • IVF_PQ
  • IVF_SQ8
  • IVF_HNSW
  • RHNSW_FLAT
  • RHNSW_PQ
  • RHNSW_SQ

But in segcore, all searching requests share the same knowhere::VecIndexPtr for one segment. When different searching requests use different searching parameters, for example 'nprobe', memory overlaps.

cydrain avatar Aug 16 '22 09:08 cydrain

@yanliang567 this issue has been fixed, please verify.⁣@Martin Li⁣ /assign @yanliang567

cydrain avatar Aug 25 '22 03:08 cydrain

/unassign

cydrain avatar Aug 25 '22 03:08 cydrain

waiting for v2.1.2 for verification

yanliang567 avatar Aug 31 '22 11:08 yanliang567

@zhoushewei could you please retry on v2.1.2, which was release a few days ago? /assign @zhoushewei

yanliang567 avatar Sep 20 '22 00:09 yanliang567

close for now, not reproduced locally

yanliang567 avatar Oct 12 '22 01:10 yanliang567