milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: collection has not been loaded to memory or load failed while loading 50M-100M vectors

Open neiblegy opened this issue 2 years ago • 19 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.2.0
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.0
- OS(Ubuntu or CentOS): ubuntu 20.04 5.15.0 
- CPU/Memory: 
HOST:
CPU(s):              80
Thread(s) per core:  2
NUMA node(s):        2
CPU max MHz:         3900.0000
Memory:
(8 channel * 2 socket) * 3200MT/s 1TB
CONTAINER:
in k8s limit with 20 + core 300GiB

- GPU: 
not use
- Others:

Current Behavior

  1. create cluster with milvus2.2.0-release
  2. insert 50M 256-dim float vectors
  3. build HNSW index with params: ({ 'index_type': 'HNSW','metric_type': 'IP','params': {'M': 16, 'efConstruction':128} })
  4. load with 15 replicas

then error occurred: pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=collection 437543825052155869 has not been loaded to memory or load failed)>

here is the log we exported: milvus220.zip

notice that if we build 1M vectors, then every thing goes well.

Expected Behavior

load index success

Steps To Reproduce

1. create cluster with milvus2.2.0-release
2. insert 50M 256-dim float vectors 
3. build HNSW  index with params: ({ 'index_type': 'HNSW','metric_type': 'IP','params': {'M': 16, 'efConstruction':128}
})
4. load with 15 replicas

Milvus Log

milvus220.zip

Anything else?

No response

neiblegy avatar Nov 23 '22 02:11 neiblegy

Hi @neiblegy, this error means the loading was canceled, since v2.2.0, we have added a default load timeout (10min), which cancels loading for a loading collection with no any progress update for 10min. For short answer, you can just change loadTimeoutSeconds config to longer.

yah01 avatar Nov 23 '22 03:11 yah01

We need to adjust the way we calculate the progress, for now, a segment would contribute to progress only if it was loaded in all replicas, but the loading tasks are added one-by-one replica, for large data and many replicas, it will take a long time to load a segment into all replicas

yah01 avatar Nov 23 '22 03:11 yah01

/assign

yah01 avatar Nov 23 '22 03:11 yah01

Will also add executor per QueryNode, to scale well

yah01 avatar Nov 23 '22 04:11 yah01

/assign @neiblegy

yanliang567 avatar Nov 23 '22 05:11 yanliang567

The resource we could supply is well enough, we have 400 cores + 5TB mem totally, and we need 100M vectors reach 2K QPS for query. I change the loading timeout, but I have so slow loading speed

发件人: Xiaofan @.> 日期: 星期三, 2022年11月23日 11:45 收件人: milvus-io/milvus @.> 抄送: Ryan Gao @.>, Mention @.> 主题: Re: [milvus-io/milvus] [Bug]: collection has not been loaded to memory or load failed while loading 50M-100M vectors (Issue #20769)

@neiblegyhttps://github.com/neiblegy

You has to check if you have enough memory. for 50M 256-dim * 12 replica you will need at least 500GB + memory

― Reply to this email directly, view it on GitHubhttps://github.com/milvus-io/milvus/issues/20769#issuecomment-1324512591, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD6ORVNWQMWNZMPMX5KZG73WJWHOTANCNFSM6AAAAAASIPHWWA. You are receiving this because you were mentioned.Message ID: @.***>

neiblegy avatar Nov 23 '22 11:11 neiblegy

Milvus seems do the retry loop with errors: [2022/11/23 11:04:08.535 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825058760104] [error="The specified key does not exist."] [2022/11/23 11:04:08.580 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825058960234] [error="The specified key does not exist."] [2022/11/23 11:04:08.634 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059160331] [error="The specified key does not exist."] [2022/11/23 11:04:08.700 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059360430] [error="The specified key does not exist."] [2022/11/23 11:04:08.742 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059360537] [error="The specified key does not exist."] [2022/11/23 11:04:08.769 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059760643] [error="The specified key does not exist."] [2022/11/23 11:04:08.818 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059760742] [error="The specified key does not exist."] [2022/11/23 11:04:08.870 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825060160844] [error="The specified key does not exist."] [2022/11/23 11:04:09.558 +00:00] [INFO] [gc/gc_tuner.go:81] ["GC Tune done"] ["previous GOGC"=200] ["heapuse "=169] ["total memory"=114338] ["next GC"=436] ["new GOGC"=200] [2022/11/23 11:04:09.590 +00:00] [ERROR] [querynode/segment_loader.go:178] ["load segment failed when load data into memory"] [collectionID=437543825052155869] [segmentType=Sealed] [partitionID=437543825052155870] [segmentID=437543825058760097] [error="All attempts results:\nattempt #1:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825058760104)\nattempt #2:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825058960234)\nattempt #3:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059160331)\nattempt #4:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059360430)\nattempt #5:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059360537)\nattempt #6:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059760643)\nattempt #7:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059760742)\nattempt #8:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825060160844)\n"] [stack="github.com/milvus-io/milvus/internal/querynode.(*segmentLoader).LoadSegment.func3\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/segment_loader.go:178\ngithub.com/milvus-io/milvus/internal/util/funcutil.ProcessFuncParallel.func3\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/parallel.go:83"] [2022/11/23 11:04:09.590 +00:00] [ERROR] [funcutil/parallel.go:85] [loadSegmentFunc] [error="All attempts results:\nattempt #1:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825058760104)\nattempt #2:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825058960234)\nattempt #3:NoSuchKey(key=stats_log/437543825052155869/437543825052155870/437543825058760097/100/437543825059160331)\nattempt #4:NoSuchKey(key=stats_log/

发件人: 高 源 @.> 日期: 星期三, 2022年11月23日 19:24 收件人: milvus-io/milvus @.>, milvus-io/milvus @.> 抄送: Mention @.> 主题: 答复: [milvus-io/milvus] [Bug]: collection has not been loaded to memory or load failed while loading 50M-100M vectors (Issue #20769) The resource we could supply is well enough, we have 400 cores + 5TB mem totally, and we need 100M vectors reach 2K QPS for query. I change the loading timeout, but I have so slow loading speed

发件人: Xiaofan @.> 日期: 星期三, 2022年11月23日 11:45 收件人: milvus-io/milvus @.> 抄送: Ryan Gao @.>, Mention @.> 主题: Re: [milvus-io/milvus] [Bug]: collection has not been loaded to memory or load failed while loading 50M-100M vectors (Issue #20769)

@neiblegyhttps://github.com/neiblegy

You has to check if you have enough memory. for 50M 256-dim * 12 replica you will need at least 500GB + memory

― Reply to this email directly, view it on GitHubhttps://github.com/milvus-io/milvus/issues/20769#issuecomment-1324512591, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD6ORVNWQMWNZMPMX5KZG73WJWHOTANCNFSM6AAAAAASIPHWWA. You are receiving this because you were mentioned.Message ID: @.***>

neiblegy avatar Nov 23 '22 12:11 neiblegy

@neiblegy will this collection be finally loaded? The NoSuckKey error caused by compaction, v2.2.0 should be able to handle this, how long it takes to load after you changed the timeout?

yah01 avatar Nov 24 '22 02:11 yah01

For this problem, we need remote assistance. Is it convenient to give me your contact information by email so that I can find you.My email is [email protected]

Cheung7788 avatar Nov 24 '22 02:11 Cheung7788

@neiblegy

Cheung7788 avatar Nov 24 '22 02:11 Cheung7788

image @neiblegy with sizing tool, it may be impossible to load 15 replicas with 1TB memory

yah01 avatar Nov 24 '22 03:11 yah01

No, it take so long time with loop retry log infomations, then I give up

发件人: yah01 @.> 日期: 星期四, 2022年11月24日 10:54 收件人: milvus-io/milvus @.> 抄送: Ryan Gao @.>, Mention @.> 主题: Re: [milvus-io/milvus] [Bug]: collection has not been loaded to memory or load failed while loading 50M-100M vectors (Issue #20769)

@neiblegyhttps://github.com/neiblegy will this collection be finally loaded? The NoSuckKey error caused by compaction, v2.2.0 should be able to handle this, how long it takes to load after you changed the timeout?

― Reply to this email directly, view it on GitHubhttps://github.com/milvus-io/milvus/issues/20769#issuecomment-1325882250, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD6ORVKV6AUP3TK7KLDC6U3WJ3KGXANCNFSM6AAAAAASIPHWWA. You are receiving this because you were mentioned.Message ID: @.***>

neiblegy avatar Nov 24 '22 03:11 neiblegy

I have 5 machines, each physical machine have such resource: 80 cores + 1TB mem, so 5 machine have 400cores + 5TB mem Each pod resource limit at 20 cores + 300GiB, I make 15 replicas with such resources

发件人: yah01 @.> 日期: 星期四, 2022年11月24日 11:10 收件人: milvus-io/milvus @.> 抄送: Ryan Gao @.>, Mention @.> 主题: Re: [milvus-io/milvus] [Bug]: collection has not been loaded to memory or load failed while loading 50M-100M vectors (Issue #20769)

[image]https://user-images.githubusercontent.com/12216890/203685120-21c325ca-61ff-445b-b78c-b855ce777db4.png @neiblegyhttps://github.com/neiblegy with sizing toolhttps://milvus.io/tools/sizing, it may be impossible to load 15 replicas with 1TB memory

― Reply to this email directly, view it on GitHubhttps://github.com/milvus-io/milvus/issues/20769#issuecomment-1325894544, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD6ORVLXEAY7SZW54LI5SO3WJ3MDTANCNFSM6AAAAAASIPHWWA. You are receiving this because you were mentioned.Message ID: @.***>

neiblegy avatar Nov 24 '22 03:11 neiblegy

I also encountered this after upgrade to 2.2.0

use this snippet before using collection (for example, after reloading my webapp , i execute this)

    collection = get_my_collection()
    if not collection.has_index():
        build_index()
    collection.load()

@neiblegy hope it helps

hylarucoder avatar Nov 24 '22 08:11 hylarucoder

NoSuchKey for stats log: [2022/12/05 08:34:23.378 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144599609] [error="The specified key does not exist."] 32 [2022/12/05 08:34:23.414 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144599627] [error="The specified key does not exist."] 31 [2022/12/05 08:34:23.452 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144599637] [error="The specified key does not exist."] 30 [2022/12/05 08:34:23.497 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144799647] [error="The specified key does not exist."] 29 [2022/12/05 08:34:23.517 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144799657] [error="The specified key does not exist."] 28 [2022/12/05 08:34:23.556 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144799667] [error="The specified key does not exist."] 27 [2022/12/05 08:34:23.604 +00:00] [WARN] [storage/minio_chunk_manager.go:236] ["failed to stat object"] [path=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144799694] [error="The specified key does not exist."]

[2022/12/05 08:34:23.653 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=437836752123592712] [channel=by-dev-rootcoord-dml_1_437836752123592712v1] [replicaID=437836784121151489] [dstNodeID=7] [segmentIDs="[437836752144599604]"] [reason="All attempts results:\nattempt #1:NoSuchKey(key=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144599609)\nattempt #2:NoSuchKey(key=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144599627)\nattempt #3:NoSuchKey(key=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144599637)\nattempt #4:NoSuchKey(key=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144799647)\nattempt #5:NoSuchKey(key=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144799657)\nattempt #6:NoSuchKey(key=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144799667)\nattempt #7:NoSuchKey(key=stats_log/437836752123592712/437836752123592713/437836752144599604/100/437836752144799694)\n"]

yah01 avatar Dec 06 '22 10:12 yah01

We've seen no such key issue in other production environement, promote to critical urgent

xiaofan-luan avatar Dec 09 '22 02:12 xiaofan-luan

We've seen no such key issue in other production environement, promote to critical urgent

in the 2.2.2 version, I found the same problem as "NoSuchKey" in a loaded collection, and the bucket does not have corresponding path. This seems to be an accidental problem, but I think it has hidden dangers. Do you have any idea about the reason for this problem and how to avoid it ?

bingbb47 avatar Jan 13 '23 02:01 bingbb47

We've seen no such key issue in other production environement, promote to critical urgent

in the 2.2.2 version, I found the same problem as "NoSuchKey" in a loaded collection, and the bucket does not have corresponding path. This seems to be an accidental problem, but I think it has hidden dangers. Do you have any idea about the reason for this problem and how to avoid it ?

Exactly, I'm invesigating on it and failure recovery will fail due to file corrupted. @yah01 any clue yet?

xiaofan-luan avatar Jan 13 '23 09:01 xiaofan-luan

Other user encountered the same problem: https://github.com/milvus-io/milvus/discussions/22303 https://github.com/milvus-io/milvus/discussions/22300

yhmo avatar Feb 22 '23 02:02 yhmo

This should have been fixed @neiblegy

yah01 avatar May 17 '23 11:05 yah01

ok,got it

获取 Outlook for iOShttps://aka.ms/o0ukef


发件人: yah01 @.> 发送时间: Wednesday, May 17, 2023 7:08:20 PM 收件人: milvus-io/milvus @.> 抄送: Ryan Gao @.>; Mention @.> 主题: Re: [milvus-io/milvus] [Bug]: collection has not been loaded to memory or load failed while loading 50M-100M vectors (Issue #20769)

This should have been fixed @neiblegyhttps://github.com/neiblegy

― Reply to this email directly, view it on GitHubhttps://github.com/milvus-io/milvus/issues/20769#issuecomment-1551193334, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD6ORVLB3WJ6OOJ65AYWXQLXGSWSJANCNFSM6AAAAAASIPHWWA. You are receiving this because you were mentioned.Message ID: @.***>

neiblegy avatar May 17 '23 11:05 neiblegy

I'd close this issue, please free to open a new one if it does not fix on v2.2.13

yanliang567 avatar Aug 15 '23 01:08 yanliang567