milvus
milvus copied to clipboard
[Bug]: Milvus query Nodes out of memory, not buffering to disk
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.2.2
- Deployment mode(standalone or cluster): cluster AWS (not on K8s)
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): github.com/milvus-io/milvus-sdk-go/v2 v2.2.0
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: c5ad.8xlarge => 32 VCPU, 64GB
- GPU:
- Others:
Current Behavior
Collection grew to around 130 millions vectors and index doesn't fit in 64Gb anymore.
here's my node config: cacheSize is 50 GB. I'd expect it would do LRU after that since disk is enabled ? but it keep ingesting in memory until 90%+
# Related configuration of queryNode, used to run hybrid search between vector and scalar data.
queryNode:
cacheSize: 50 # GB, default 32 GB, `cacheSize` is the memory used for caching data for faster query. The `cacheSize` must be less than system memory size.
port: 21123
loadMemoryUsageFactor: 3 # The multiply factor of calculating the memory usage while loading segments
enableDisk: true # enable querynode load disk index, and search on disk index
maxDiskUsagePercentage: 95
stats:
publishInterval: 1000 # Interval for querynode to report node information (milliseconds)
dataSync:
flowGraph:
maxQueueLength: 1024 # Maximum length of task queue in flowgraph
maxParallelism: 1024 # Maximum number of tasks executed in parallel in the flowgraph
# Segcore will divide a segment into multiple chunks to enbale small index
segcore:
chunkRows: 1024 # The number of vectors in a chunk.
# Note: we have disabled segment small index since @2022.05.12. So below related configurations won't work.
# We won't create small index for growing segments and search on these segments will directly use bruteforce scan.
smallIndex:
nlist: 2048 # small index nlist, recommend to set sqrt(chunkRows), must smaller than chunkRows/8
nprobe: 1 # nprobe to search small index, based on your accuracy requirement, must smaller than nlist
cache:
enabled: true
memoryLimit: 2147483648 # 2 GB, 2 * 1024 *1024 *1024
scheduler:
receiveChanSize: 10240
unsolvedQueueSize: 10240
# maxReadConcurrentRatio is the concurrency ratio of read task (search task and query task).
# Max read concurrency would be the value of `runtime.NumCPU * maxReadConcurrentRatio`.
# It defaults to 2.0, which means max read concurrency would be the value of runtime.NumCPU * 2.
# Max read concurrency must greater than or equal to 1, and less than or equal to runtime.NumCPU * 100.
maxReadConcurrentRatio: 2.0 # (0, 100]
cpuRatio: 30.0 # ratio used to estimate read task cpu usage.
grouping:
enabled: true
maxNQ: 1000
topKMergeRatio: 10.0
Expected Behavior
Following https://github.com/milvus-io/milvus/issues/16893 seems like LRU cache should have kicked in after memory was around 50Gb?
Steps To Reproduce
No response
Milvus Log
Query node:
[2023/01/15 23:08:42.414 +00:00] [INFO] [querynode/load_segment_task.go:40] ["LoadSegmentTask PreExecute start"] [msgID=11054]
[2023/01/15 23:08:42.414 +00:00] [INFO] [querynode/load_segment_task.go:66] ["LoadSegmentTask PreExecute done"] [msgID=11054]
[2023/01/15 23:08:42.414 +00:00] [INFO] [querynode/load_segment_task.go:71] ["LoadSegmentTask Execute start"] [msgID=11054]
[2023/01/15 23:08:42.414 +00:00] [INFO] [querynode/segment_loader.go:103] ["segmentLoader start loading..."] [collectionID=438744159027462165] [segmentType=Sealed] [segmentNum=2]
[2023/01/15 23:08:42.414 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=438744159027462165] [channel=by-dev-rootcoord-dml_49_438744159027462165v19] [replicaID=438744281082494977] [dstNodeID=7] [segmentIDs="[438744159173585301]"] [reason="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58644 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.414 +00:00] [WARN] [querynode/impl_utils.go:41] ["shard cluster failed to load segments"] [traceID=3a797fa2361574dc] [shard=by-dev-rootcoord-dml_49_438744159027462165v19] [segmentIDs="[438744159173585301]"] [error="follower 7 failed to load segment, reason load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58644 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.414 +00:00] [WARN] [querynode/impl.go:498] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58714 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.415 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=2] [memUsage=59028] [memUsageAfterLoad=58715] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945086] ["insert size"=21] ["insert offset"=13033] [segmentID=438744159176945086] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_32_438744159027462165v2]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945464] ["insert size"=17] ["insert offset"=12862] [segmentID=438744159176945464] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_52_438744159027462165v22]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945078] ["insert size"=15] ["insert offset"=13041] [segmentID=438744159176945078] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_49_438744159027462165v19]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945367] ["insert size"=15] ["insert offset"=12868] [segmentID=438744159176945367] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_56_438744159027462165v26]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945464] [len=17]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945086] [len=21]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945077] ["insert size"=21] ["insert offset"=13002] [segmentID=438744159176945077] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_45_438744159027462165v15]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945078] [len=15]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945367] [len=15]
[2023/01/15 23:08:42.416 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945077] [len=21]
[2023/01/15 23:08:42.416 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=438744159027462165] [channel=by-dev-rootcoord-dml_32_438744159027462165v2] [replicaID=438744281082494977] [dstNodeID=7] [segmentIDs="[438744159169337476,438744159173585324]"] [reason="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58695 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.416 +00:00] [WARN] [querynode/impl_utils.go:41] ["shard cluster failed to load segments"] [traceID=868d004778c1b59] [shard=by-dev-rootcoord-dml_32_438744159027462165v2] [segmentIDs="[438744159169337476,438744159173585324]"] [error="follower 7 failed to load segment, reason load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58695 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=1] [memUsage=58872] [memUsageAfterLoad=58715] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.417 +00:00] [ERROR] [querynode/segment_loader.go:125] ["load failed, OOM if loaded"] [collectionID=438744159027462165] [segmentType=Sealed] ["loadSegmentRequest msgID"=11054] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 52 MB, concurrency = 1, usedMemAfterLoad = 58715 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"] [stack="github.com/milvus-io/milvus/internal/querynode.(*segmentLoader).LoadSegment\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/segment_loader.go:125\ngithub.com/milvus-io/milvus/internal/querynode.(*loadSegmentsTask).Execute\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/load_segment_task.go:81\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:109\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).taskLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:131"]
[2023/01/15 23:08:42.417 +00:00] [WARN] [querynode/load_segment_task.go:125] ["failed to load segment"] [collectionID=438744159027462165] [replicaID=438744281082494977] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 52 MB, concurrency = 1, usedMemAfterLoad = 58715 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.417 +00:00] [WARN] [querynode/task_scheduler.go:111] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 52 MB, concurrency = 1, usedMemAfterLoad = 58715 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/load_segment_task.go:40] ["LoadSegmentTask PreExecute start"] [msgID=10922]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/load_segment_task.go:66] ["LoadSegmentTask PreExecute done"] [msgID=10922]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/load_segment_task.go:71] ["LoadSegmentTask Execute start"] [msgID=10922]
[2023/01/15 23:08:42.417 +00:00] [WARN] [querynode/impl.go:498] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 52 MB, concurrency = 1, usedMemAfterLoad = 58715 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.417 +00:00] [INFO] [querynode/segment_loader.go:103] ["segmentLoader start loading..."] [collectionID=438744159027462165] [segmentType=Sealed] [segmentNum=4]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945085] ["insert size"=13] ["insert offset"=12973] [segmentID=438744159176945085] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_37_438744159027462165v7]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_filter_dm_node.go:91] ["Filter invalid message in QueryNode"] [traceID=5e1aa9cfac367d16]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:201] ["insertNode operator"] [segmentID=438744159176945074] ["insert size"=16] ["insert offset"=13030] [segmentID=438744159176945074] [collectionID=438744159027462165] [vchannel=by-dev-rootcoord-dml_34_438744159027462165v4]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945085] [len=13]
[2023/01/15 23:08:42.417 +00:00] [DEBUG] [querynode/flow_graph_insert_node.go:397] ["Do insert done"] [collectionID=438744159027462165] [segmentID=438744159176945074] [len=16]
[2023/01/15 23:08:42.418 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=438744159027462165] [channel=by-dev-rootcoord-dml_45_438744159027462165v15] [replicaID=438744281082494977] [dstNodeID=7] [segmentIDs="[438744159169337483]"] [reason="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58644 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.418 +00:00] [WARN] [querynode/impl_utils.go:41] ["shard cluster failed to load segments"] [traceID=4e5a2c656f911daa] [shard=by-dev-rootcoord-dml_45_438744159027462165v15] [segmentIDs="[438744159169337483]"] [error="follower 7 failed to load segment, reason load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58644 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.418 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=4] [memUsage=59440] [memUsageAfterLoad=58817] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.419 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=2] [memUsage=59128] [memUsageAfterLoad=58817] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.420 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=1] [memUsage=58973] [memUsageAfterLoad=58817] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.420 +00:00] [ERROR] [querynode/segment_loader.go:125] ["load failed, OOM if loaded"] [collectionID=438744159027462165] [segmentType=Sealed] ["loadSegmentRequest msgID"=10922] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58817 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"] [stack="github.com/milvus-io/milvus/internal/querynode.(*segmentLoader).LoadSegment\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/segment_loader.go:125\ngithub.com/milvus-io/milvus/internal/querynode.(*loadSegmentsTask).Execute\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/load_segment_task.go:81\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:109\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).taskLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:131"]
[2023/01/15 23:08:42.420 +00:00] [WARN] [querynode/load_segment_task.go:125] ["failed to load segment"] [collectionID=438744159027462165] [replicaID=438744281082494977] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58817 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.420 +00:00] [WARN] [querynode/task_scheduler.go:111] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58817 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.420 +00:00] [WARN] [querynode/impl.go:498] ["load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58817 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.423 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=38e3d11b94886abd] [shard=by-dev-rootcoord-dml_52_438744159027462165v22] [segmentIDs="[438744159169337487,438744159171901868]"]
[2023/01/15 23:08:42.423 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=5516c7c6158384f4] [shard=by-dev-rootcoord-dml_56_438744159027462165v26] [segmentIDs="[438744159173585304,438744159169337472]"]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=615cf707af61d80] [shard=by-dev-rootcoord-dml_45_438744159027462165v15] [segmentIDs="[438744159171902064,438744159168051081,438744159173585308,438744159167032303,438744159170549731,438744159166027628]"]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159166027628,438744159167032303,438744159168051081,438744159170549731,438744159171902064,438744159173585308]"] [nodeID=6]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159166027628,438744159167032303,438744159168051081,438744159170549731,438744159171902064,438744159173585308]"] [timeInQueue=31.841µs]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159166027628,438744159167032303,438744159168051081,438744159170549731,438744159171902064,438744159173585308]"] [nodeID=6]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/load_segment_task.go:40] ["LoadSegmentTask PreExecute start"] [msgID=11013]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/load_segment_task.go:66] ["LoadSegmentTask PreExecute done"] [msgID=11013]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/load_segment_task.go:71] ["LoadSegmentTask Execute start"] [msgID=11013]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=3bd3b0cbefba0fcd] [shard=by-dev-rootcoord-dml_56_438744159027462165v26] [segmentIDs="[438744159168051077,438744159170549858,438744159167032314,438744159171902076]"]
[2023/01/15 23:08:42.424 +00:00] [INFO] [querynode/segment_loader.go:103] ["segmentLoader start loading..."] [collectionID=438744159027462165] [segmentType=Sealed] [segmentNum=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159167032314,438744159168051077,438744159170549858,438744159171902076]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159167032314,438744159168051077,438744159170549858,438744159171902076]"] [timeInQueue=26.181µs]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159167032314,438744159168051077,438744159170549858,438744159171902076]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159111402585]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159111402585]"] [timeInQueue=24.721µs]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159111402585]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl_utils.go:28] ["LoadSegment start to transfer load with shard cluster"] [traceID=736177a458e32205] [shard=by-dev-rootcoord-dml_37_438744159027462165v7] [segmentIDs="[438744159169337474,438744159173585313,438744159168051079]"]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159168051079,438744159169337474,438744159173585313]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159092122537,438744159169337491,438744159170549732]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159168051079,438744159169337474,438744159173585313]"] [timeInQueue=25.511µs]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159168051079,438744159169337474,438744159173585313]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159092122537,438744159169337491,438744159170549732]"] [timeInQueue=22.39µs]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159092122537,438744159169337491,438744159170549732]"] [nodeID=6]
[2023/01/15 23:08:42.425 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=6] [memUsage=59854] [memUsageAfterLoad=58920] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159107185205,438744159169337470,438744159171901857]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159107185205,438744159169337470,438744159171901857]"] [timeInQueue=26.84µs]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159107185205,438744159169337470,438744159171901857]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159160605919]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159160605919]"] [timeInQueue=27.14µs]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159160605919]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159096540860,438744159167032304,438744159168051117]"] [nodeID=6]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159096540860,438744159167032304,438744159168051117]"] [timeInQueue=25.411µs]
[2023/01/15 23:08:42.426 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159096540860,438744159167032304,438744159168051117]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159149962214,438744159166027630,438744159168051074]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159149962214,438744159166027630,438744159168051074]"] [timeInQueue=28.271µs]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159149962214,438744159166027630,438744159168051074]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=3] [memUsage=59387] [memUsageAfterLoad=58920] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.427 +00:00] [WARN] [querynode/shard_cluster.go:651] ["follower load segment failed"] [collectionID=438744159027462165] [channel=by-dev-rootcoord-dml_52_438744159027462165v22] [replicaID=438744281082494977] [dstNodeID=5] [segmentIDs="[438744159169337487,438744159171901868]"] [reason="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58646 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.427 +00:00] [WARN] [querynode/impl_utils.go:41] ["shard cluster failed to load segments"] [traceID=38e3d11b94886abd] [shard=by-dev-rootcoord-dml_52_438744159027462165v22] [segmentIDs="[438744159169337487,438744159171901868]"] [error="follower 5 failed to load segment, reason load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58646 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159167032317,438744159169337486,438744159171901858]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159167032317,438744159169337486,438744159171901858]"] [timeInQueue=30.921µs]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159167032317,438744159169337486,438744159171901858]"] [nodeID=6]
[2023/01/15 23:08:42.427 +00:00] [INFO] [querynode/impl.go:471] ["loadSegmentsTask init"] [collectionID=438744159027462165] [segmentIDs="[438744159078867149,438744159168051080,438744159169337481,438744159170549864]"] [nodeID=6]
[2023/01/15 23:08:42.428 +00:00] [INFO] [querynode/impl.go:476] ["loadSegmentsTask start "] [collectionID=438744159027462165] [segmentIDs="[438744159078867149,438744159168051080,438744159169337481,438744159170549864]"] [timeInQueue=35.841µs]
[2023/01/15 23:08:42.428 +00:00] [INFO] [querynode/impl.go:489] ["loadSegmentsTask Enqueue done"] [collectionID=438744159027462165] [segmentIDs="[438744159078867149,438744159168051080,438744159169337481,438744159170549864]"] [nodeID=6]
[2023/01/15 23:08:42.428 +00:00] [INFO] [querynode/segment_loader.go:949] ["predict memory and disk usage while loading (in MiB)"] [collectionID=438744159027462165] [concurrency=1] [memUsage=59075] [memUsageAfterLoad=58920] [diskUsageAfterLoad=0]
[2023/01/15 23:08:42.428 +00:00] [ERROR] [querynode/segment_loader.go:125] ["load failed, OOM if loaded"] [collectionID=438744159027462165] [segmentType=Sealed] ["loadSegmentRequest msgID"=11013] [error="load segment failed, OOM if load, collectionID = 438744159027462165, maxSegmentSize = 51 MB, concurrency = 1, usedMemAfterLoad = 58920 MB, totalMem = 63725 MB, thresholdFactor = 0.900000"] [stack="github.com/milvus-io/milvus/internal/querynode.(*segmentLoader).LoadSegment\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/segment_loader.go:125\ngithub.com/milvus-io/milvus/internal/querynode.(*loadSegmentsTask).Execute\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/load_segment_task.go:81\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:109\ngithub.com/milvus-io/milvus/internal/querynode.(*taskScheduler).taskLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:131"]
Anything else?
Run out of memory:

@ThomasAlxDmy LRU is not supported for query/search operations for now. The cacheSize is for retrieve data in query requests, and enableDisk is for diskann Index. So in short, it requires enough memory for a collection to do any search/query operations.
/assign @ThomasAlxDmy
Got it, thank you.
Do you have an idea of when it will be supported? My collection will grow about 60M vector per day so I'm going to hit physical ram limit problem. I don't need to keep the full dataset for now, any idea how I can dynamically reduce the size of the collection (expire holder items?).
Also I have partitioned my collection is it possible to release manually the older partitions? I upgraded the machine to a bigger memory and it takes 3 hours+ to reload the collection, how can we reduce that time ? Is it possible to load the recent partition first and make the collection available for query?
- if the data lifetime is persistent, i think you can try Time to live (TTL) at a collection level to expire the data automatically. check the details in release note https://milvus.io/docs/release_notes.md#v2.2.0.
- Also using partitions is another option, you can have a few partitions, and load/release them individually to save memory. But there is a few known issues/limitations if you have many partitions, so please control the partitions below 20. BTW, we are trying to improve the partitions in milvus 2.3, which will be available in H1 2023.
@yanliang567 tried 1) but it doesn't seem to work https://github.com/milvus-io/milvus/issues/21802
yeah I tried with 200 partitions and we ran into problems when trying to load a lot of them... etcd denies it. also seems like you can't load a partition if some part of the collection is already loaded. Would be nice if that was independent - you should be able to load the first partition and then incrementally load older ones without having to unload first.
@yanliang567 tried 1) but it doesn't seem to work #21802
yeah I tried with 200 partitions and we ran into problems when trying to load a lot of them... etcd denies it. also seems like you can't load a partition if some part of the collection is already loaded. Would be nice if that was independent - you should be able to load the first partition and then incrementally load older ones without having to unload first.
- I commented in #21802, i think you need to upgrade the pymilvus.
- yes, it just remind to me that this is a limitation for load/release partitions, and the community is trying to improve it as you suggested above in milvus 2.3.
the partition issue will be fixed in 2.3 release. We will support load partition while the collection is loaded.
@xiaofan-luan confirming the for this is going to be on 2.2.4? Can you link it to issue/PR for reference?
@xiaofan-luan confirming the for this is going to be on 2.2.4? Can you link it to issue/PR for reference?
This will be in 2.3, related PR https://github.com/milvus-io/milvus/pull/22655
@ThomasAlxDmy Feel free to give comment on it
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.