milvus [Enhancement]: Loading speed optmization in the serverless mode

Is there an existing issue for this?

[X] I have searched the existing issues

What would you like to be added?

Currently, loading process in the serverless mode can be rather slow and make search&&query latency very high and unacceptable

Why is this needed?

No response

Anything else?

No response

Apr 08 '24 03:04 MrPresent-Han

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

May 08 '24 04:05 stale[bot]

meet same issue about loadding connection very slow. Did you slove this problem?

May 13 '24 03:05 AnyangAngus

meet same issue about loadding connection very slow. Did you slove this problem?

what problem are you facing? I don't think you are facing similar issue as we talk about. if loading collection is slow, usually you will need to check:

how many segments you have, too many segments will cause load slow
add more querynodes will help on improving load speed.
maybe you will need check the tuning parameters if you have large pod size or what, we encourage you to share logs and pprof and explain what is your use cases so we can help

May 13 '24 04:05 xiaofan-luan

meet same issue about loadding connection very slow. Did you slove this problem?

what problem are you facing? I don't think you are facing similar issue as we talk about. if loading collection is slow, usually you will need to check:

how many segments you have, too many segments will cause load slow

add more querynodes will help on improving load speed.

maybe you will need check the tuning parameters if you have large pod size or what, we encourage you to share logs and pprof and explain what is your use cases so we can help

@xiaofan-luan Report my production environment: milvus verison: v2.3.1 mode:DISTRIBUTED, all nodes one pod Loaded Collections:552 All Collections:552 Entities:32,039 embedding size: 3072 Object store: OSS

now even load a empty collection cost a lot, here is my test script:

import random
import time

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

_HOST = 'xxx'
_PORT = '19530'

if __name__ == '__main__':
    connections.connect(host=_HOST, port=_PORT)

    dim = 512
    collection_name = "loadtest"
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    field1 = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False)
    field2 = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim)
    schema = CollectionSchema(fields=[field1, field2])
    collection = Collection(name=collection_name, schema=schema)
    print("\ncollection created:", collection_name)

    index_param = {
        "index_type": "IVF_FLAT",
        "params": {"nlist": 256},
        "metric_type": "L2"}
    collection.create_index("embedding", index_param)

    num = 10000
    data = [
        [i for i in range(num)],
        [[random.random() for _ in range(dim)] for _ in range(num)],
    ]
    '''
    collection.insert(data)
    collection.flush()
    print("Insert", num, "vectors")
    print("Collection row count:", collection.num_entities)
    '''
    start = time.time()
    collection.load()
    end = time.time()
    print("Load collection, time cost:", (end-start)*1000, "ms")

script output: collection created: loadtest Load collection, time cost: 39166.789293289185 ms

May 13 '24 07:05 AnyangAngus

meet same issue about loadding connection very slow. Did you slove this problem?

what problem are you facing? I don't think you are facing similar issue as we talk about. if loading collection is slow, usually you will need to check:

how many segments you have, too many segments will cause load slow

add more querynodes will help on improving load speed.

maybe you will need check the tuning parameters if you have large pod size or what, we encourage you to share logs and pprof and explain what is your use cases so we can help

@xiaofan-luan Report my production environment: milvus verison: v2.3.1 mode:DISTRIBUTED, all nodes one pod Loaded Collections:552 All Collections:552 Entities:32,039 embedding size: 3072 Object store: OSS

now even load a empty collection cost a lot, here is my test script:
import random
import time

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

_HOST = 'xxx'
_PORT = '19530'

if __name__ == '__main__':
    connections.connect(host=_HOST, port=_PORT)

    dim = 512
    collection_name = "loadtest"
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    field1 = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False)
    field2 = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim)
    schema = CollectionSchema(fields=[field1, field2])
    collection = Collection(name=collection_name, schema=schema)
    print("\ncollection created:", collection_name)

    index_param = {
        "index_type": "IVF_FLAT",
        "params": {"nlist": 256},
        "metric_type": "L2"}
    collection.create_index("embedding", index_param)

    num = 10000
    data = [
        [i for i in range(num)],
        [[random.random() for _ in range(dim)] for _ in range(num)],
    ]
    '''
    collection.insert(data)
    collection.flush()
    print("Insert", num, "vectors")
    print("Collection row count:", collection.num_entities)
    '''
    start = time.time()
    collection.load()
    end = time.time()
    print("Load collection, time cost:", (end-start)*1000, "ms")
script output: collection created: loadtest Load collection, time cost: 39166.789293289185 ms

try use 2.3.15 and see if improved. And, load is a one time ddl operation, 30s is not slow at all. though the collection is small, the load perf can be improved for sure. but when you work on large dataset, load is expected to be more than 10s

May 13 '24 08:05 xiaofan-luan

meet same issue about loadding connection very slow. Did you slove this problem?

what problem are you facing? I don't think you are facing similar issue as we talk about. if loading collection is slow, usually you will need to check:

how many segments you have, too many segments will cause load slow

add more querynodes will help on improving load speed.

maybe you will need check the tuning parameters if you have large pod size or what, we encourage you to share logs and pprof and explain what is your use cases so we can help

@xiaofan-luan Report my production environment: milvus verison: v2.3.1 mode:DISTRIBUTED, all nodes one pod Loaded Collections:552 All Collections:552 Entities:32,039 embedding size: 3072 Object store: OSS now even load a empty collection cost a lot, here is my test script:
import random
import time

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

_HOST = 'xxx'
_PORT = '19530'

if __name__ == '__main__':
    connections.connect(host=_HOST, port=_PORT)

    dim = 512
    collection_name = "loadtest"
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    field1 = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False)
    field2 = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim)
    schema = CollectionSchema(fields=[field1, field2])
    collection = Collection(name=collection_name, schema=schema)
    print("\ncollection created:", collection_name)

    index_param = {
        "index_type": "IVF_FLAT",
        "params": {"nlist": 256},
        "metric_type": "L2"}
    collection.create_index("embedding", index_param)

    num = 10000
    data = [
        [i for i in range(num)],
        [[random.random() for _ in range(dim)] for _ in range(num)],
    ]
    '''
    collection.insert(data)
    collection.flush()
    print("Insert", num, "vectors")
    print("Collection row count:", collection.num_entities)
    '''
    start = time.time()
    collection.load()
    end = time.time()
    print("Load collection, time cost:", (end-start)*1000, "ms")
script output: collection created: loadtest Load collection, time cost: 39166.789293289185 ms
try use 2.3.15 and see if improved. And, load is a one time ddl operation, 30s is not slow at all. though the collection is small, the load perf can be improved for sure. but when you work on large dataset, load is expected to be more than 10s

@xiaofan-luan But I try to test a docker compose STANDALONE mode milvus, the same load script just cost 3s which is 10 times fast to the DISTRIBUTED mode milvus.

May 13 '24 08:05 AnyangAngus

meet same issue about loadding connection very slow. Did you slove this problem?

what problem are you facing? I don't think you are facing similar issue as we talk about. if loading collection is slow, usually you will need to check:

how many segments you have, too many segments will cause load slow

add more querynodes will help on improving load speed.

maybe you will need check the tuning parameters if you have large pod size or what, we encourage you to share logs and pprof and explain what is your use cases so we can help

@xiaofan-luan Report my production environment: milvus verison: v2.3.1 mode:DISTRIBUTED, all nodes one pod Loaded Collections:552 All Collections:552 Entities:32,039 embedding size: 3072 Object store: OSS now even load a empty collection cost a lot, here is my test script:
import random
import time

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

_HOST = 'xxx'
_PORT = '19530'

if __name__ == '__main__':
    connections.connect(host=_HOST, port=_PORT)

    dim = 512
    collection_name = "loadtest"
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    field1 = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False)
    field2 = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim)
    schema = CollectionSchema(fields=[field1, field2])
    collection = Collection(name=collection_name, schema=schema)
    print("\ncollection created:", collection_name)

    index_param = {
        "index_type": "IVF_FLAT",
        "params": {"nlist": 256},
        "metric_type": "L2"}
    collection.create_index("embedding", index_param)

    num = 10000
    data = [
        [i for i in range(num)],
        [[random.random() for _ in range(dim)] for _ in range(num)],
    ]
    '''
    collection.insert(data)
    collection.flush()
    print("Insert", num, "vectors")
    print("Collection row count:", collection.num_entities)
    '''
    start = time.time()
    collection.load()
    end = time.time()
    print("Load collection, time cost:", (end-start)*1000, "ms")
script output: collection created: loadtest Load collection, time cost: 39166.789293289185 ms
try use 2.3.15 and see if improved. And, load is a one time ddl operation, 30s is not slow at all. though the collection is small, the load perf can be improved for sure. but when you work on large dataset, load is expected to be more than 10s
@xiaofan-luan But I try to test a docker compose STANDALONE mode milvus, the same load script just cost 3s which is 10 times fast to the DISTRIBUTED mode milvus.

the implementation logic of standlaone and distributed is exactly the same. and they should same for most usecases. please offer logs if you need help on invetigating details.

May 13 '24 09:05 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Jun 13 '24 23:06 stale[bot]

milvus milvus copied to clipboard

[Enhancement]: Loading speed optmization in the serverless mode

Is there an existing issue for this?

What would you like to be added?

Why is this needed?

Anything else?

milvus
milvus copied to clipboard