milvus
milvus copied to clipboard
[Bug]: Collection not getting loaded, also loading progress is shown as 0, "no flush channel found for the segment, unable to flush", "Failed to get shard delegator, channel not found"
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.3.3
- Deployment mode(standalone or cluster): Milvus cluster deployed in kubernetes
- MQ type(rocksmq, pulsar or kafka): External Kafka (AWS MSK)
- SDK version(e.g. pymilvus v2.0.0rc2): Pymilvus 2.3.3
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
After ingesting around 1.7 billion entities across 100 collections, when am trying to do collection.load(), am getting collection not loaded exception and loading progress is shown as 0 as well
These are key errors i see in data coords and data nodes "no flush channel found for the segment, unable to flush", "Failed to get shard delegator, channel not found"
Expected Behavior
collection getting loaded successfully and also search operation happening successfully
i tried with single collection and 31 million entities ingested, its working as expected
but its not working with 100 collection and 1.7 billion entities
Steps To Reproduce
No response
Milvus Log
Anything else?
Here
milvusdiskann1 -> cluster is used for 1.7 billion entities and 100 collections, where collection is not getting loaded and thus search is not happening successfully
milvusdiskannsmall -> cluster is used for 31 million entities and 1 collection, where collection is getting loaded successfully and also search is happening successfully
Here are few more details
- Milvus version: 2.3.3
- Deployment mode(standalone or cluster): Milvus cluster deployed in kubernetes
- MQ type(rocksmq, pulsar or kafka): External Kafka (AWS MSK)
- SDK version(e.g. pymilvus v2.0.0rc2): Pymilvus 2.3.3
num_shards of collection: 1
Index segment size: 512MB (Default)
Vector dim: 768
metric_type: L2
index_type: DISKANN
For milvusdiskann1 cluster
Scale: 1.7 billion entities (around)
num of collections: 100
num of query nodes: 70 (which are of R5ad.8xlarge / R5ad.12xlarge instances types)
num of datanodes: 9
num of indexnodes: 8
num of proxy: 3
Schema we are using
fields = [
FieldSchema(name='id', dtype=DataType.INT64, description="", is_primary=True, auto_id=True),
FieldSchema(name='store_address', dtype=DataType.VARCHAR, description="", max_length=512),
FieldSchema(name='review', dtype=DataType.VARCHAR, description="", max_length=16384),
FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, description="", dim=768, is_index=True),
]
schema = CollectionSchema(
fields=fields,
description="",
enable_dynamic_field=True
)
collection = Collection(name='testCollectionName', schema=schema,
using='default',
shards_num=1)
There are few more details which i added here to this milvus discussion: https://github.com/milvus-io/milvus/discussions/29420
/assign @xige-16 could you help on check this issue
besides memory, one more thing we need to check is that the disks for querynodes are enough for loading or not. @xige-16 @maheshchil
@yanliang567 liked i mentioned in above details we have used 70 query nodes and each node is of r5ad.8x large (or) r5ad.12x large instance types
although milvus sizing tool recommended around 57 nodes,
would this still not suffice ?
So,
Here is another classic example where am facing search failures
For milvusdiskannsmall cluster, which has single collection and 64 million entities, initially collection loading was successful and search was also successful
but when i tried it 1 day later, although collections were loading, search operation failed with error:
" <MilvusException: (code=46, message=failed to search: segment=446490888346061635: segment lacks: channel=by-dev-milvusdiskannsmall-dml_0_446490888309311183v0: channel not available)>"
also log from quernode:
milvusdiskannsmall-querynode-748887c9b8-77jpj [2023/12/24 06:02:45.335 +00:00] [WARN] [delegator/delegator_data.go:394] ["worker failed to load │
│ segments"] [traceID=c0c9488720fa14f4918ec05eb5942414] [collectionID=446490888309311183] [channel=by-dev-milvusdiskannsmall-dml_0_4464908883093 │
│ 11183v0] [replicaID=446490892033064961] [workID=72] [segments="[446490888401013499]"] [error="limit=256: request limit exceeded"] [errorVerbose │
│ ="limit=256: request limit exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrServiceR │
│ equestLimitExceeded\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:300\n | github.com/milvus-io/milvus/internal/querynodev2 │
│ /segments.(*segmentLoader).requestResource\n | \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment_loader.go:373\n |
Search operation was successful a day back and now its not
@yah01 this seems to be the bug 2.3.4 should fix?
before we release next week, you can try
grpc.client.minResetInterval -> 30000, queryCoord.taskMergeCap-> 1 common.threadCoreCoefficient.highPriority -> 32
see if it works
/assign @yah01
/assign @congqixia @congqixia was this issue fixed in 2.3 branch
For the load problem, https://github.com/milvus-io/milvus/pull/29192 fixed this, the problem is the target observer may stop working and more collections/partitions would cause this problem with higher probability.
@congqixia any comment about the segment lack?
find lots of "request resource failed" may trigger connect reset issue as well fixed in 2.3 latest: pr #29061
since the log is not complete, cannot be sure what is the root cause for segment 446490888338773121
went missing
could be known issue fixed by #29344
@congqixia For milvusdiskann1 cluster: I have put the log dump here: https://github.com/milvus-io/milvus/issues/29426#issuecomment-1868210233
For milvusdiskannsmall cluster: logs are here: https://github.com/milvus-io/milvus/issues/29426#issuecomment-1868447642
these are the logs i extracted using milvus provided script
are these logs still not sufficient ?
@xiaofan-luan
- Does 2.3.4 covers a fix for this ?
- With 2.3.4 will single collection loading and searching works consistently ?
- are there any limitations on the number of entities we need to ingest for a single collection ?
- Also are there any limitations on number of collections for DiskAnn for collection loading and searching to work successfully ?
- I think we have fixed the issues you met, please retry with milvus v2.3.4.
- once the collection loaded, you can search and you don't need to load the collection any more, even with new data inserted.
- technically, no
- it depends how big the milvus cluster(how much hardware resource), in v2.3.4 I believe 2k collections is reasonable for your cluster.
@maheshchil please feel free to keep us posted if any updates.
close this issue for now. please free to file a new one if anyone met it again