milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Collection not getting loaded, also loading progress is shown as 0, "no flush channel found for the segment, unable to flush", "Failed to get shard delegator, channel not found"

Open maheshchil opened this issue 1 year ago • 20 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.3.3
- Deployment mode(standalone or cluster): Milvus cluster deployed in kubernetes
- MQ type(rocksmq, pulsar or kafka): External Kafka (AWS MSK)   
- SDK version(e.g. pymilvus v2.0.0rc2): Pymilvus 2.3.3
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After ingesting around 1.7 billion entities across 100 collections, when am trying to do collection.load(), am getting collection not loaded exception and loading progress is shown as 0 as well

These are key errors i see in data coords and data nodes "no flush channel found for the segment, unable to flush", "Failed to get shard delegator, channel not found"

Expected Behavior

collection getting loaded successfully and also search operation happening successfully

i tried with single collection and 31 million entities ingested, its working as expected image

but its not working with 100 collection and 1.7 billion entities

Steps To Reproduce

No response

Milvus Log

milvus-log.tar.gz

Anything else?

Here

milvusdiskann1 -> cluster is used for 1.7 billion entities and 100 collections, where collection is not getting loaded and thus search is not happening successfully

milvusdiskannsmall -> cluster is used for 31 million entities and 1 collection, where collection is getting loaded successfully and also search is happening successfully

image

maheshchil avatar Dec 22 '23 09:12 maheshchil

Here are few more details

- Milvus version: 2.3.3
- Deployment mode(standalone or cluster): Milvus cluster deployed in kubernetes
- MQ type(rocksmq, pulsar or kafka):   External Kafka (AWS MSK)
- SDK version(e.g. pymilvus v2.0.0rc2): Pymilvus 2.3.3
num_shards of collection: 1
Index segment size: 512MB (Default)
Vector dim: 768
metric_type: L2
index_type: DISKANN

For milvusdiskann1 cluster

Scale: 1.7 billion entities (around)
num of collections: 100
num of query nodes: 70 (which are of R5ad.8xlarge / R5ad.12xlarge instances types)
num of datanodes: 9
num of indexnodes: 8
num of proxy: 3

Schema we are using

fields = [
            FieldSchema(name='id', dtype=DataType.INT64, description="", is_primary=True, auto_id=True),
            FieldSchema(name='store_address', dtype=DataType.VARCHAR, description="", max_length=512),
            FieldSchema(name='review', dtype=DataType.VARCHAR, description="", max_length=16384),
            FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, description="", dim=768, is_index=True),
        ]

        schema = CollectionSchema(
            fields=fields,
            description="",
            enable_dynamic_field=True
        )

        collection = Collection(name='testCollectionName', schema=schema,
                                using='default',
                                shards_num=1)

maheshchil avatar Dec 22 '23 09:12 maheshchil

There are few more details which i added here to this milvus discussion: https://github.com/milvus-io/milvus/discussions/29420

maheshchil avatar Dec 22 '23 09:12 maheshchil

/assign @xige-16 could you help on check this issue

xiaofan-luan avatar Dec 22 '23 10:12 xiaofan-luan

besides memory, one more thing we need to check is that the disks for querynodes are enough for loading or not. @xige-16 @maheshchil

yanliang567 avatar Dec 23 '23 05:12 yanliang567

@yanliang567 liked i mentioned in above details we have used 70 query nodes and each node is of r5ad.8x large (or) r5ad.12x large instance types

although milvus sizing tool recommended around 57 nodes,

would this still not suffice ?

image

image

maheshchil avatar Dec 23 '23 05:12 maheshchil

So,

Here is another classic example where am facing search failures

For milvusdiskannsmall cluster, which has single collection and 64 million entities, initially collection loading was successful and search was also successful

but when i tried it 1 day later, although collections were loading, search operation failed with error: " <MilvusException: (code=46, message=failed to search: segment=446490888346061635: segment lacks: channel=by-dev-milvusdiskannsmall-dml_0_446490888309311183v0: channel not available)>"

also log from quernode:

milvusdiskannsmall-querynode-748887c9b8-77jpj [2023/12/24 06:02:45.335 +00:00] [WARN] [delegator/delegator_data.go:394] ["worker failed to load │
│  segments"] [traceID=c0c9488720fa14f4918ec05eb5942414] [collectionID=446490888309311183] [channel=by-dev-milvusdiskannsmall-dml_0_4464908883093 │
│ 11183v0] [replicaID=446490892033064961] [workID=72] [segments="[446490888401013499]"] [error="limit=256: request limit exceeded"] [errorVerbose │
│ ="limit=256: request limit exceeded\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/pkg/util/merr.WrapErrServiceR │
│ equestLimitExceeded\n  | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:300\n  | github.com/milvus-io/milvus/internal/querynodev2 │
│ /segments.(*segmentLoader).requestResource\n  | \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment_loader.go:373\n  | 

Search operation was successful a day back and now its not image

maheshchil avatar Dec 24 '23 06:12 maheshchil

milvusdiskannsmall1-log.tar.gz

Logs for milvusdiskannsmall cluster

maheshchil avatar Dec 24 '23 06:12 maheshchil

@yah01 this seems to be the bug 2.3.4 should fix?

xiaofan-luan avatar Dec 24 '23 12:12 xiaofan-luan

before we release next week, you can try

grpc.client.minResetInterval -> 30000, queryCoord.taskMergeCap-> 1 common.threadCoreCoefficient.highPriority -> 32

see if it works

xiaofan-luan avatar Dec 24 '23 13:12 xiaofan-luan

/assign @yah01

yanliang567 avatar Dec 25 '23 00:12 yanliang567

/assign @congqixia @congqixia was this issue fixed in 2.3 branch

yanliang567 avatar Dec 25 '23 00:12 yanliang567

For the load problem, https://github.com/milvus-io/milvus/pull/29192 fixed this, the problem is the target observer may stop working and more collections/partitions would cause this problem with higher probability.

yah01 avatar Dec 25 '23 03:12 yah01

@congqixia any comment about the segment lack?

yah01 avatar Dec 25 '23 03:12 yah01

find lots of "request resource failed" may trigger connect reset issue as well fixed in 2.3 latest: pr #29061

congqixia avatar Dec 25 '23 03:12 congqixia

since the log is not complete, cannot be sure what is the root cause for segment 446490888338773121 went missing could be known issue fixed by #29344

congqixia avatar Dec 25 '23 03:12 congqixia

@congqixia For milvusdiskann1 cluster: I have put the log dump here: https://github.com/milvus-io/milvus/issues/29426#issuecomment-1868210233

For milvusdiskannsmall cluster: logs are here: https://github.com/milvus-io/milvus/issues/29426#issuecomment-1868447642

these are the logs i extracted using milvus provided script

are these logs still not sufficient ?

maheshchil avatar Jan 03 '24 03:01 maheshchil

@xiaofan-luan

  1. Does 2.3.4 covers a fix for this ?
  2. With 2.3.4 will single collection loading and searching works consistently ?
  3. are there any limitations on the number of entities we need to ingest for a single collection ?
  4. Also are there any limitations on number of collections for DiskAnn for collection loading and searching to work successfully ?

maheshchil avatar Jan 03 '24 04:01 maheshchil

  1. I think we have fixed the issues you met, please retry with milvus v2.3.4.
  2. once the collection loaded, you can search and you don't need to load the collection any more, even with new data inserted.
  3. technically, no
  4. it depends how big the milvus cluster(how much hardware resource), in v2.3.4 I believe 2k collections is reasonable for your cluster.

yanliang567 avatar Jan 04 '24 01:01 yanliang567

@maheshchil please feel free to keep us posted if any updates.

yanliang567 avatar Jan 08 '24 01:01 yanliang567

close this issue for now. please free to file a new one if anyone met it again

yanliang567 avatar Jul 08 '24 02:07 yanliang567