milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Milvus query Nodes out of memory, not buffering to disk

Open ThomasAlxDmy opened this issue 2 years ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.2.2
- Deployment mode(standalone or cluster): cluster AWS (not on K8s)
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): github.com/milvus-io/milvus-sdk-go/v2 v2.2.0
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: c5ad.8xlarge => 32 VCPU, 64GB
- GPU: 
- Others:

Current Behavior

Collection grew to around 130 millions vectors and index doesn't fit in 64Gb anymore.

here's my node config: cacheSize is 50 GB. I'd expect it would do LRU after that since disk is enabled ? but it keep ingesting in memory until 90%+

# Related configuration of queryNode, used to run hybrid search between vector and scalar data.
queryNode:
  cacheSize: 50 # GB, default 32 GB, `cacheSize` is the memory used for caching data for faster query. The `cacheSize` must be less than system memory size.
  port: 21123
  loadMemoryUsageFactor: 3 # The multiply factor of calculating the memory usage while loading segments
  enableDisk: true # enable querynode load disk index, and search on disk index
  maxDiskUsagePercentage: 95

  stats:
    publishInterval: 1000 # Interval for querynode to report node information (milliseconds)
  dataSync:
    flowGraph:
      maxQueueLength: 1024 # Maximum length of task queue in flowgraph
      maxParallelism: 1024 # Maximum number of tasks executed in parallel in the flowgraph
  # Segcore will divide a segment into multiple chunks to enbale small index
  segcore:
    chunkRows: 1024 # The number of vectors in a chunk.
    # Note: we have disabled segment small index since @2022.05.12. So below related configurations won't work.
    # We won't create small index for growing segments and search on these segments will directly use bruteforce scan.
    smallIndex:
      nlist: 2048 # small index nlist, recommend to set sqrt(chunkRows), must smaller than chunkRows/8
      nprobe: 1 # nprobe to search small index, based on your accuracy requirement, must smaller than nlist
  cache:
    enabled: true
    memoryLimit: 2147483648 # 2 GB, 2 * 1024 *1024 *1024

  scheduler:
    receiveChanSize: 10240
    unsolvedQueueSize: 10240
    # maxReadConcurrentRatio is the concurrency ratio of read task (search task and query task).
    # Max read concurrency would be the value of `runtime.NumCPU * maxReadConcurrentRatio`.
    # It defaults to 2.0, which means max read concurrency would be the value of runtime.NumCPU * 2.
    # Max read concurrency must greater than or equal to 1, and less than or equal to runtime.NumCPU * 100.
    maxReadConcurrentRatio: 2.0 # (0, 100]
    cpuRatio: 30.0 # ratio used to estimate read task cpu usage.

  grouping:
    enabled: true
    maxNQ: 1000
    topKMergeRatio: 10.0

Expected Behavior

Following https://github.com/milvus-io/milvus/issues/16893 seems like LRU cache should have kicked in after memory was around 50Gb?

Steps To Reproduce

No response

Milvus Log

Query node:

[2023/01/15 23:08:42.414 +00:00] [INFO] [querynode/load_segment_task.go:40] ["LoadSegmentTask PreExecute start"] [msgID=11054]
[2023/01/15 23:08:42.414 +00:00] [INFO] [querynode/load_segment_task.go:66] ["LoadSegmentTask PreExecute done"] [msgID=11054]
[2023/01/15 23:08:42.414 +00:00] [INFO] [querynode/load_segment_task.go:71] ["LoadSegmentTask Execute start"] [msgID=11054]
[2023/01/15 23:08:42.414 +00:00] [INFO] [querynode/segment_loader.go:103] ["segmentLoader start loading..."] [collectionID=438744159027462165] [segmentType=Sealed] [segmentNum=2]
[2023/01/15 23:08:42.414 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=438744159027462165] [channel=by-dev-rootcoord-dml_49_438744159027462165v19] [replicaID=438744281082494977] [dstNodeID=7] [segmentIDs="[438744159173585301]"] [reason="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58644 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.414 +00:00] [WARN] [querynode/impl_utils.go:41] ["shard cluster failed to load segments"] [traceID=3a797fa2361574dc] [shard=by-dev-rootcoord-dml_49_438744159027462165v19] [segmentIDs="[438744159173585301]"] [error="follower 7 failed to load segment, reason load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58644 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.414 +00:00] [WARN] [querynode/impl.go:498] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58714 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.415 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=2] [memUsage=59028] [memUsageAfterLoad=58715] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945086] ["insert size"=21] ["insert offset"=13033] [segmentID=438744159176945086] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_32_438744159027462165v2]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945464] ["insert size"=17] ["insert offset"=12862] [segmentID=438744159176945464] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_52_438744159027462165v22]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945078] ["insert size"=15] ["insert offset"=13041] [segmentID=438744159176945078] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_49_438744159027462165v19]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945367] ["insert size"=15] ["insert offset"=12868] [segmentID=438744159176945367] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_56_438744159027462165v26]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945464] [len=17]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945086] [len=21]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945077] ["insert size"=21] ["insert offset"=13002] [segmentID=438744159176945077] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_45_438744159027462165v15]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945078] [len=15]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945367] [len=15]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945077] [len=21]
[2023/01/15 23:08:42.416 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=438744159027462165] [channel=by-dev-rootcoord-dml_32_438744159027462165v2] [replicaID=438744281082494977] [dstNodeID=7] [segmentIDs="[438744159169337476,438744159173585324]"] [reason="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58695 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.416 +00:00] [WARN] [querynode/impl_utils.go:41] ["shard cluster failed to load segments"] [traceID=868d004778c1b59] [shard=by-dev-rootcoord-dml_32_438744159027462165v2] [segmentIDs="[438744159169337476,438744159173585324]"] [error="follower 7 failed to load segment, reason load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58695 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=1] [memUsage=58872] [memUsageAfterLoad=58715] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.417 +00:00] [ERROR] [querynode/segment_loader.go:125] ["load failed, OOM if loaded"] [collectionID=438744159027462165] [segmentType=Sealed] ["loadSegmentRequest msgID"=11054] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 52 MB, concurrency = 1, usedMemAfterLoad = 58715 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"] [stack="github.com/milvus-io/milvus/internal/querynode.(*segmentLoader).LoadSegment\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/segment_loader.go:125\ngithub.com/milvus-io/milvus/internal/querynode.(*loadSegmentsTask).Execute\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/load_segment_task.go:81\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:109\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).taskLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:131"]
[2023/01/15 23:08:42.417 +00:00] [WARN] [querynode/load_segment_task.go:125] ["failed to load segment"] [collectionID=438744159027462165] [replicaID=438744281082494977] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 52 MB, concurrency = 1, usedMemAfterLoad = 58715 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.417 +00:00] [WARN] [querynode/task_scheduler.go:111] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 52 MB, concurrency = 1, usedMemAfterLoad = 58715 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/load_segment_task.go:40] ["LoadSegmentTask PreExecute start"] [msgID=10922]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/load_segment_task.go:66] ["LoadSegmentTask PreExecute done"] [msgID=10922]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/load_segment_task.go:71] ["LoadSegmentTask Execute start"] [msgID=10922]
[2023/01/15 23:08:42.417 +00:00] [WARN] [querynode/impl.go:498] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 52 MB, concurrency = 1, usedMemAfterLoad = 58715 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/segment_loader.go:103] ["segmentLoader start loading..."] [collectionID=438744159027462165] [segmentType=Sealed] [segmentNum=4]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945085] ["insert size"=13] ["insert offset"=12973] [segmentID=438744159176945085] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_37_438744159027462165v7]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945074] ["insert size"=16] ["insert offset"=13030] [segmentID=438744159176945074] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_34_438744159027462165v4]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945085] [len=13]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945074] [len=16]
[2023/01/15 23:08:42.418 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=438744159027462165] [channel=by-dev-rootcoord-dml_45_438744159027462165v15] [replicaID=438744281082494977] [dstNodeID=7] [segmentIDs="[438744159169337483]"] [reason="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58644 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.418 +00:00] [WARN] [querynode/impl_utils.go:41] ["shard cluster failed to load segments"] [traceID=4e5a2c656f911daa] [shard=by-dev-rootcoord-dml_45_438744159027462165v15] [segmentIDs="[438744159169337483]"] [error="follower 7 failed to load segment, reason load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58644 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.418 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=4] [memUsage=59440] [memUsageAfterLoad=58817] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.419 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=2] [memUsage=59128] [memUsageAfterLoad=58817] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.420 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=1] [memUsage=58973] [memUsageAfterLoad=58817] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.420 +00:00] [ERROR] [querynode/segment_loader.go:125] ["load failed, OOM if loaded"] [collectionID=438744159027462165] [segmentType=Sealed] ["loadSegmentRequest msgID"=10922] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58817 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"] [stack="github.com/milvus-io/milvus/internal/querynode.(*segmentLoader).LoadSegment\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/segment_loader.go:125\ngithub.com/milvus-io/milvus/internal/querynode.(*loadSegmentsTask).Execute\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/load_segment_task.go:81\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:109\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).taskLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:131"]
[2023/01/15 23:08:42.420 +00:00] [WARN] [querynode/load_segment_task.go:125] ["failed to load segment"] [collectionID=438744159027462165] [replicaID=438744281082494977] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58817 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.420 +00:00] [WARN] [querynode/task_scheduler.go:111] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58817 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.420 +00:00] [WARN] [querynode/impl.go:498] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58817 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.423 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=38e3d11b94886abd] [shard=by-dev-rootcoord-dml_52_438744159027462165v22] [segmentIDs="[438744159169337487,438744159171901868]"]
[2023/01/15 23:08:42.423 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=5516c7c6158384f4] [shard=by-dev-rootcoord-dml_56_438744159027462165v26] [segmentIDs="[438744159173585304,438744159169337472]"]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=615cf707af61d80] [shard=by-dev-rootcoord-dml_45_438744159027462165v15] [segmentIDs="[438744159171902064,438744159168051081,438744159173585308,438744159167032303,438744159170549731,438744159166027628]"]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159166027628,438744159167032303,438744159168051081,438744159170549731,438744159171902064,438744159173585308]"] [nodeID=6]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159166027628,438744159167032303,438744159168051081,438744159170549731,438744159171902064,438744159173585308]"] [timeInQueue=31.841µs]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159166027628,438744159167032303,438744159168051081,438744159170549731,438744159171902064,438744159173585308]"] [nodeID=6]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/load_segment_task.go:40] ["LoadSegmentTask PreExecute start"] [msgID=11013]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/load_segment_task.go:66] ["LoadSegmentTask PreExecute done"] [msgID=11013]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/load_segment_task.go:71] ["LoadSegmentTask Execute start"] [msgID=11013]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=3bd3b0cbefba0fcd] [shard=by-dev-rootcoord-dml_56_438744159027462165v26] [segmentIDs="[438744159168051077,438744159170549858,438744159167032314,438744159171902076]"]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/segment_loader.go:103] ["segmentLoader start loading..."] [collectionID=438744159027462165] [segmentType=Sealed] [segmentNum=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159167032314,438744159168051077,438744159170549858,438744159171902076]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159167032314,438744159168051077,438744159170549858,438744159171902076]"] [timeInQueue=26.181µs]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159167032314,438744159168051077,438744159170549858,438744159171902076]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159111402585]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159111402585]"] [timeInQueue=24.721µs]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159111402585]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=736177a458e32205] [shard=by-dev-rootcoord-dml_37_438744159027462165v7] [segmentIDs="[438744159169337474,438744159173585313,438744159168051079]"]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159168051079,438744159169337474,438744159173585313]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159092122537,438744159169337491,438744159170549732]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159168051079,438744159169337474,438744159173585313]"] [timeInQueue=25.511µs]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159168051079,438744159169337474,438744159173585313]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159092122537,438744159169337491,438744159170549732]"] [timeInQueue=22.39µs]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159092122537,438744159169337491,438744159170549732]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=6] [memUsage=59854] [memUsageAfterLoad=58920] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159107185205,438744159169337470,438744159171901857]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159107185205,438744159169337470,438744159171901857]"] [timeInQueue=26.84µs]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159107185205,438744159169337470,438744159171901857]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159160605919]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159160605919]"] [timeInQueue=27.14µs]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159160605919]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159096540860,438744159167032304,438744159168051117]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159096540860,438744159167032304,438744159168051117]"] [timeInQueue=25.411µs]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159096540860,438744159167032304,438744159168051117]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159149962214,438744159166027630,438744159168051074]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159149962214,438744159166027630,438744159168051074]"] [timeInQueue=28.271µs]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159149962214,438744159166027630,438744159168051074]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=3] [memUsage=59387] [memUsageAfterLoad=58920] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.427 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=438744159027462165] [channel=by-dev-rootcoord-dml_52_438744159027462165v22] [replicaID=438744281082494977] [dstNodeID=5] [segmentIDs="[438744159169337487,438744159171901868]"] [reason="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58646 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.427 +00:00] [WARN] [querynode/impl_utils.go:41] ["shard cluster failed to load segments"] [traceID=38e3d11b94886abd] [shard=by-dev-rootcoord-dml_52_438744159027462165v22] [segmentIDs="[438744159169337487,438744159171901868]"] [error="follower 5 failed to load segment, reason load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58646 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159167032317,438744159169337486,438744159171901858]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159167032317,438744159169337486,438744159171901858]"] [timeInQueue=30.921µs]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159167032317,438744159169337486,438744159171901858]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159078867149,438744159168051080,438744159169337481,438744159170549864]"] [nodeID=6]
[2023/01/15 23:08:42.428 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159078867149,438744159168051080,438744159169337481,438744159170549864]"] [timeInQueue=35.841µs]
[2023/01/15 23:08:42.428 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159078867149,438744159168051080,438744159169337481,438744159170549864]"] [nodeID=6]
[2023/01/15 23:08:42.428 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=1] [memUsage=59075] [memUsageAfterLoad=58920] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.428 +00:00] [ERROR] [querynode/segment_loader.go:125] ["load failed, OOM if loaded"] [collectionID=438744159027462165] [segmentType=Sealed] ["loadSegmentRequest msgID"=11013] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58920 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"] [stack="github.com/milvus-io/milvus/internal/querynode.(*segmentLoader).LoadSegment\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/segment_loader.go:125\ngithub.com/milvus-io/milvus/internal/querynode.(*loadSegmentsTask).Execute\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/load_segment_task.go:81\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:109\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).taskLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:131"]

Anything else?

Run out of memory:

image

ThomasAlxDmy avatar Jan 15 '23 23:01 ThomasAlxDmy

@ThomasAlxDmy LRU is not supported for query/search operations for now. The cacheSize is for retrieve data in query requests, and enableDisk is for diskann Index. So in short, it requires enough memory for a collection to do any search/query operations.

/assign @ThomasAlxDmy

yanliang567 avatar Jan 16 '23 01:01 yanliang567

Got it, thank you.

Do you have an idea of when it will be supported? My collection will grow about 60M vector per day so I'm going to hit physical ram limit problem. I don't need to keep the full dataset for now, any idea how I can dynamically reduce the size of the collection (expire holder items?).

Also I have partitioned my collection is it possible to release manually the older partitions? I upgraded the machine to a bigger memory and it takes 3 hours+ to reload the collection, how can we reduce that time ? Is it possible to load the recent partition first and make the collection available for query?

ThomasAlxDmy avatar Jan 16 '23 02:01 ThomasAlxDmy

  1. if the data lifetime is persistent, i think you can try Time to live (TTL) at a collection level to expire the data automatically. check the details in release note https://milvus.io/docs/release_notes.md#v2.2.0.
  2. Also using partitions is another option, you can have a few partitions, and load/release them individually to save memory. But there is a few known issues/limitations if you have many partitions, so please control the partitions below 20. BTW, we are trying to improve the partitions in milvus 2.3, which will be available in H1 2023.

yanliang567 avatar Jan 16 '23 07:01 yanliang567

@yanliang567 tried 1) but it doesn't seem to work https://github.com/milvus-io/milvus/issues/21802

yeah I tried with 200 partitions and we ran into problems when trying to load a lot of them... etcd denies it. also seems like you can't load a partition if some part of the collection is already loaded. Would be nice if that was independent - you should be able to load the first partition and then incrementally load older ones without having to unload first.

ThomasAlxDmy avatar Jan 18 '23 19:01 ThomasAlxDmy

@yanliang567 tried 1) but it doesn't seem to work #21802

yeah I tried with 200 partitions and we ran into problems when trying to load a lot of them... etcd denies it. also seems like you can't load a partition if some part of the collection is already loaded. Would be nice if that was independent - you should be able to load the first partition and then incrementally load older ones without having to unload first.

  1. I commented in #21802, i think you need to upgrade the pymilvus.
  2. yes, it just remind to me that this is a limitation for load/release partitions, and the community is trying to improve it as you suggested above in milvus 2.3.

yanliang567 avatar Jan 19 '23 01:01 yanliang567

the partition issue will be fixed in 2.3 release. We will support load partition while the collection is loaded.

xiaofan-luan avatar Jan 19 '23 09:01 xiaofan-luan

@xiaofan-luan confirming the for this is going to be on 2.2.4? Can you link it to issue/PR for reference?

ThomasAlxDmy avatar Mar 15 '23 21:03 ThomasAlxDmy

@xiaofan-luan confirming the for this is going to be on 2.2.4? Can you link it to issue/PR for reference?

This will be in 2.3, related PR https://github.com/milvus-io/milvus/pull/22655

xiaofan-luan avatar Mar 16 '23 02:03 xiaofan-luan

@ThomasAlxDmy Feel free to give comment on it

xiaofan-luan avatar Mar 16 '23 02:03 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Aug 02 '23 05:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 03 '23 19:09 stale[bot]