milvus
milvus copied to clipboard
[Bug]: milvus-querynode pod hang ,and it would hang again after 20mins running if restart the hung pod
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:V2.1.0
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.2
- OS(Ubuntu or CentOS): Ubuntu 18.04
- CPU/Memory: 6C,94GB
- GPU:
- Others:
Current Behavior
当前环境属于新装环境,正常运行2天;cpu低于20%;内存低于50%;load average: 7.08 ;milvus库中表20张,3000维度+;整体数据约10万左右;milvus-querynode容器突然挂死,出现故障后重启milvus-querynode容器,正常20分钟后又突然挂死,再次重新拉起暂时正常;
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
业务系统异常信息(python开发,业务信息已经剔除): 2022-08-03 09:45:58.677 | INFO | data_pack:131 - ********** 2022-08-03 09:46:39.385 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=err: failed to connect 172.27.0.10:21123, reason: context deadline exceeded , /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:232 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).Call /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:227 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:445 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard.func1 /go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:55 github.com/milvus-io/milvus/internal/proxy.roundRobinPolicy /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:465 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:324 github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute.func1.1 /go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 golang.org/x/sync/errgroup.(*Group).Go.func1 /usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit )> 2022-08-03 09:47:15.265 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=All attempts results: attempt #1:fail to get shard leaders from QueryCoord: no replica available attempt #2:fail to get shard leaders from QueryCoord: no replica available attempt #3:fail to get shard leaders from QueryCoord: no replica available attempt #4:fail to get shard leaders from QueryCoord: no replica available attempt #5:fail to get shard leaders from QueryCoord: no replica available attempt #6:fail to get shard leaders from QueryCoord: no replica available attempt #7:fail to get shard leaders from QueryCoord: no replica available attempt #8:context deadline exceeded )> ........ERROR..................... 2022-08-03 09:49:32.845 | ERROR | kafka_consumer:220 - <BaseException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=22, reason=ShardCluster for by-dev-rootcoord-dml_5_434981356418629633v1 replicaID 434981333510389770 is no available)>
milvus-querynode 容器异常信息 : [2022/08/03 01:46:01.305 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851788] [segmentID=435017738016522241] [segmentType=Growing] [loadIndex=false], terminate called after throwing an instance of 'faiss::FaissException', [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851796, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [434987259691794433] (170ms)"], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["start reduce search result, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (171ms)"], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:40] ["shard leader get valid search results"] [numbers=2], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=0] [nq=1] [topk=50], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=1] [nq=1] [topk=50], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [querynode/result.go:135] ["skip duplicated search result"] [count=3], [2022/08/03 01:46:01.310 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (171ms)"], [2022/08/03 01:46:01.311 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851796] [segmentID=435017738016522241] [segmentType=Growing] [loadIndex=false], [2022/08/03 01:46:01.315 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851788] [segmentID=435017458584125441] [segmentType=Growing] [loadIndex=false], [2022/08/03 01:46:01.321 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851788] [segmentID=434987259691794433] [segmentType=Sealed] [loadIndex=true], [2022/08/03 01:46:01.322 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435024794290851796] [segmentID=434987272261861377] [segmentType=Sealed] [loadIndex=true], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851788, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [434987259691794433] (374ms)"], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["start reduce search result, msgID = 435024794290851788, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (374ms)"], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:40] ["shard leader get valid search results"] [numbers=2], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=0] [nq=1] [topk=20], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=1] [nq=1] [topk=20], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [querynode/result.go:135] ["skip duplicated search result"] [count=1], [2022/08/03 01:46:01.326 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851788, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_44_434987254370533377v0, segmentIDs = [] (375ms)"], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851796, fromSharedLeader = true, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [434987272261861377] (187ms)"], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["start reduce search result, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [] (188ms)"], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:40] ["shard leader get valid search results"] [numbers=2], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=0] [nq=1] [topk=50], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:43] [reduceSearchResultData] ["result No."=1] [nq=1] [topk=50], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [querynode/result.go:135] ["skip duplicated search result"] [count=3], [2022/08/03 01:46:01.327 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["do search done, msgID = 435024794290851796, fromSharedLeader = false, vChannel = by-dev-rootcoord-dml_45_434987254370533377v1, segmentIDs = [] (188ms)"], terminate called recursively,
Anything else?
No response
容器再次挂死,等着各位专家给出方案,感谢
2022/08/03 03:28:56.077 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435026539947045023] [segmentID=434981803007410177] [segmentType=Sealed] [loadIndex=false]
terminate called after throwing an instance of 'faiss::FaissException'
what(): Error in faiss::IndexIVF::search_preassigned(faiss::Index::idx_t, const float*, faiss::Index::idx_t, const idx_t*, const float*, float*, faiss::Index::idx_t*, bool, const faiss::IVFSearchParameters*, faiss::IndexIVFStats*, faiss::BitsetView) const::<lambda(faiss::Index::idx_t, float, float*, faiss::Index::idx_t*, faiss::BitsetView)> at /go/src/github.com/milvus-io/milvus/cmake_build/3rdparty_download/knowhere-subbuild/src/knowhere_ep/thirdparty/faiss/faiss/IndexIVF.cpp:563: Error: 'key < (idx_t)nlist' failed: Invalid key=1554658163683368960 nlist=223
terminate called recursively
terminate called recursively
maybe duplicate with #17277
mark it as urgent. @cydrain any fix ideas ?
@cydrain @xiaofan-luan what's the status of #17353? found it in the previous issue but it was never merged.
@cydrain @xiaofan-luan what's the status of #17353? found it in the previous issue but it was never merged.
as I explain in previous comments, we have not found the root cause, so this PR is not merged.
query with following parameters:
index type: IVF_SQ8
metric type: L2
nlist: 1024
nprobe: 128
collect more terminate logs:
log1 querynode.log
[2022/08/03 12:20:33.638 +00:00] [DEBUG] [querynode/validate.go:110] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [querynode/validate.go:126] ["validateOnStreamReplica getSegmentIDsByVChannel"] [collectionID=434987254370533377] [vChannel=by-dev-rootcoord-dml_45_434987254370533377v1] [partitionIDs="[434987254370533378]"] [segmentIDs="[435032636368879617]"] [],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435035547805696869] [segmentID=435032636368879617] [segmentType=Growing] [loadIndex=false],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435035547805696869] [segmentID=435032636368879617] [segmentType=Growing] [loadIndex=false],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (353ms)"],
[2022/08/03 12:20:33.639 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/03 12:20:33.640 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435035547805696869] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
terminate called without an active exception,
terminate called recursively,
terminate called recursively,
log2 (similar log with this issue)
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [querynode/task_read.go:161] ["query msg can do"] [collectionID=434987254370533377] [sm.GuaranteeTimestamp=2022/08/03 13:09:46.064 +00:00] [serviceTime=2022/08/03 13:09:46.264 +00:00] ["delta milliseconds"=-200] [channel=by-dev-rootcoord-delta_45_434987254370533377v1] [msgID=435035547805863594],
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [querynode/task_read.go:161] ["query msg can do"] [collectionID=434987254370533377] [sm.GuaranteeTimestamp=2022/08/03 13:09:46.214 +00:00] [serviceTime=2022/08/03 13:09:46.264 +00:00] ["delta milliseconds"=-50] [channel=by-dev-rootcoord-delta_45_434987254370533377v1] [msgID=435035547805863602],
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (180ms)"],
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/03 13:09:46.424 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435035547805863602] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/03 13:09:46.425 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435035547805863594] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
[2022/08/03 13:09:46.426 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (354ms)"],
[2022/08/03 13:09:46.426 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
terminate called after throwing an instance of 'faiss::FaissException',
what(): Error in faiss::IndexIVF::search_preassigned(faiss::Index::idx_t, const float*, faiss::Index::idx_t, const idx_t*, const float*, float*, faiss::Index::idx_t*, bool, const faiss::IVFSearchParameters*, faiss::IndexIVFStats*, faiss::BitsetView) const::<lambda(faiss::Index::idx_t, float, float*, faiss::Index::idx_t*, faiss::BitsetView)> at /go/src/github.com/milvus-io/milvus/cmake_build/3rdparty_download/knowhere-subbuild/src/knowhere_ep/thirdparty/faiss/faiss/IndexIVF.cpp:563: Error: 'key < (idx_t)nlist' failed: Invalid key=139928558108992 nlist=447,
,
terminate called recursively,
[2022/08/03 13:09:46.426 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435035547805863594] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461524796] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (346ms)"],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461524788] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 01:50:49.226 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435048268461524796] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 01:50:49.227 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435048268461524796] [segmentID=435032636368879617] [segmentType=Growing] [loadIndex=false],
terminate called after throwing an instance of 'terminate called recursively,
faiss::FaissException',
what():
log4
[2022/08/04 02:45:28.243 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461633791] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (358ms)"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/segment.go:300] ["do search on segment done"] [msgID=435048268461633793] [segmentID=435031740190294017] [segmentType=Growing] [loadIndex=false],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (350ms)"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [timerecord/time_recorder.go:78] ["searchTask: enqueueEnd (350ms)"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461633791] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/validate.go:49] ["read target partitions"] [collectionID=434987254370533377] [partitionIDs="[434987254370533378]"],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461633793] [segmentID=435032636068724738] [segmentType=Sealed] [loadIndex=true],
[2022/08/04 02:45:28.244 +00:00] [DEBUG] [querynode/segment.go:288] ["start do search on segment"] [msgID=435048268461633793] [segmentID=435031738250428418] [segmentType=Sealed] [loadIndex=true],
terminate called after throwing an instance of 'faiss::FaissExceptionterminate called recursively,
/assign @cydrain /unassign
There are 2 issues in this case which cause process terminated
- a legacy check fail in faiss (Error: 'key < (idx_t)nlist' failed: Invalid key=139928558108992 nlist=447) -- it must be caused by memory overflow
- FaissException is thrown out in an omp code block in faiss (faiss/IndexIVF.cpp:563) -- it's a Faiss's bug
Test memory overflow with tool valgrind or sanitize on Knowhere, no findings. val.log
Try to build Milvus with knowhere enabled following sanitize parameters:
append_flags( CMAKE_CXX_FLAGS FLAGS "-fsanitize=address" "-fno-omit-frame-pointer" "-fno-stack-protector" "-fno-var-tracking" "-lasan" "-g" )
Launch Milvus in the normal way, Milvus cannot start up and report this error:
==14409==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
Launch Milvus with LD_PRELOAD command:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libasan.so.4 ./bin/milvus run standalone > /tmp/md/standalone.log 2>&1
Milvus still cannot start and report this error:
==25312==Sanitizer CHECK failed: ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:40 ((allocated_for_dlsym)) < ((kDlsymAllocPoolSize)) (1111, 1024)
Googled solutions for this error, all not work.
updated 2022/08/08
Some limitations about this issue:
- cannot get user data and script
- cannot reproduce this issue with randomly generate data 50000 (dim=3840) vector data, create IVF_SQ8 L2 index and do search repeatly, cannot reproduce this issue
Try following ways to debug, but no valuable findings:
- Using Address Sanitizer a. Enable sanitize options in Knowhere and run sift1M IVF_SQ8 benchmark, see no memory overflow b. Enable sanitize options in Knowhere and run unittest, no sanitize error c. Enable sanitize options in segcore and run unitest, no knowhere or faiss sanitize error sanitize_test_c_api.log sanitize_test_query.log d. Build Milvus with local Knowhere version (enable sanitize options), Milvus cannot launch
- Compare related API search_preassigned() between knowhere v1.0.1 and v1.1.5, no valuable findings.
need follow customer steps to reproduce this issue:
- create collection (dim = 4096)
- load collection
- create index ("IVF_SQ8")
- insert one entity
- search all collection using this inserted vector
- if data in this collection exceed 10000, create a new collection
- repeat step 2~6
Generate python script issue_18509.py following above steps, but still cannot reproduce this issue.
- create one collection with 7936 dim
- one thread keeps inserting data one by one
- two thread keep searching milvus_script.txt issue_18509.py.txt
all terminate logs from customers querynode.log querynode-20220803-2.log querynode-20220804-3.log querynode-20220804-6.log
This issue is because, except IDMAP and ANNOY, all other knowhere index types are not thread-safe when searching:
- BIN_IVF
- IVF_FLAT
- IVF_PQ
- IVF_SQ8
- IVF_HNSW
- RHNSW_FLAT
- RHNSW_PQ
- RHNSW_SQ
But in segcore, all searching requests share the same knowhere::VecIndexPtr for one segment. When different searching requests use different searching parameters, for example 'nprobe', memory overlaps.
@yanliang567 this issue has been fixed, please verify.@Martin Li /assign @yanliang567
/unassign
waiting for v2.1.2 for verification
@zhoushewei could you please retry on v2.1.2, which was release a few days ago? /assign @zhoushewei
close for now, not reproduced locally