milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Can only use one data node, index node

Open didalaviva opened this issue 11 months ago • 5 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.3.6
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2): v2.3.7
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 4C16G
- GPU: None
- Others: None

Current Behavior

When I use one datanode + indexnode + querynode in one container, It work well. Then I expansion to two exmaples, It can't work

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

The container A [2024/03/18 11:44:12.878 +08:00] [INFO] [segments/segment_loader.go:584] ["load fields..."] [traceID=ad5cd64c8b92161c262edb4d39ec3779] [collectionID=448141181348513817] [partitionID=448141181348513818] [shard=milvus_t0-rootcoord-dml_8_448141181348513817v0] [segmentID=448141181348714298] [indexedFields="[104]"] [2024/03/18 11:44:12.878 +08:00] [WARN] [segments/cgo_util.go:58] ["CStatus returns err"] [errorName=UnknownError] [errorMsg="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917446/7/448141181348513818/448141181348714298/SLICE_META"] [2024/03/18 11:44:12.883 +08:00] [INFO] [gc/gc_tuner.go:90] ["GC Tune done"] ["previous GOGC"=200] ["heapuse "=28] ["total memory"=72] ["next GC"=61] ["new GOGC"=200] [gc-pause=25.762µs] [gc-pause-end=1710733452881997060] [2024/03/18 11:44:12.889 +08:00] [WARN] [kafka/kafka_consumer.go:138] ["consume msg failed"] [topic=milvus_t0-rootcoord-dml_5] [groupID=datanode-38-milvus_t0-rootcoord-dml_5_448141181342745720v0-true] [error="Local: Timed out"] [2024/03/18 11:44:12.891 +08:00] [WARN] [segments/segment_loader.go:261] ["load segment failed when load data into memory"] [traceID=ad5cd64c8b92161c262edb4d39ec3779] [collectionID=448141181348513817] [segmentType=Sealed] [partitionID=448141181348513818] [segmentID=448141181348714298] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917446/7/448141181348513818/448141181348714298/SLICE_META: service internal error: UnknownError"] [errorVerbose="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917446/7/448141181348513818/448141181348714298/SLICE_META: service internal error: UnknownError\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrServiceInternal\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/pkg/util/merr/utils.go:343\n | github.com/milvus-io/milvus/internal/querynodev2/segments.HandleCStatus\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/internal/querynodev2/segments/cgo_util.go:59\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*LoadIndexInfo).appendIndexData\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/internal/querynodev2/segments/load_index_info.go:159\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*LoadIndexInfo).appendLoadIndexInfo\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/internal/querynodev2/segments/load_index_info.go:100\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*LocalSegment).LoadIndex\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/internal/querynodev2/segments/segment.go:855\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).loadFieldIndex\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/internal/querynodev2/segments/segment_loader.go:725\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).loadFieldsIndex\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/internal/querynodev2/segments/segment_loader.go:678\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).loadSegment\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/internal/querynodev2/segments/segment_loader.go:587\n | github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentLoader).Load.func4\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/internal/querynodev2/segments/segment_loader.go:259\n | github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n | \t/var/lib/docker/columbus/repo/pkg_repo/2024011912/138704/37407404/1705639174568/src_repo/recall-milvus/pkg/util/funcutil/parallel.go:86\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (2) invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917446/7/448141181348513818/448141181348714298/SLICE_META\nWraps: (3) service internal error: UnknownError\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"]

The container B [2024/03/18 11:55:16.852 +08:00] [WARN] [cluster/worker.go:75] ["failed to call LoadSegments, worker return error"] [traceID=71903ba5ba8d7ad4a9073048197f0496] [workerID=38] [errorCode=UnexpectedError] [reason="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917817/1/448141181348513818/448141181348514393/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:55:16.852 +08:00] [WARN] [delegator/delegator_data.go:395] ["worker failed to load segments"] [traceID=71903ba5ba8d7ad4a9073048197f0496] [collectionID=448141181348513817] [channel=milvus_t0-rootcoord-dml_8_448141181348513817v0] [replicaID=448141181500522500] [workID=38] [segments="[448141181348514393]"] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917817/1/448141181348513818/448141181348514393/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:55:16.852 +08:00] [WARN] [querynodev2/services.go:454] ["delegator failed to load segments"] [traceID=71903ba5ba8d7ad4a9073048197f0496] [collectionID=448141181348513817] [partitionID=448141181348513818] [shard=milvus_t0-rootcoord-dml_8_448141181348513817v0] [segmentID=448141181348514393] [currentNodeID=39] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917817/1/448141181348513818/448141181348514393/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:55:17.787 +08:00] [INFO] [querynodev2/services.go:420] ["received load segments request"] [traceID=4e91aa4208c61f437d96f5d4541b3030] [collectionID=448141181348513817] [partitionID=448141181348513818] [shard=milvus_t0-rootcoord-dml_8_448141181348513817v0] [segmentID=448141181348514393] [currentNodeID=39] [version=1710734117786478706] [needTransfer=true] [2024/03/18 11:55:17.787 +08:00] [INFO] [segments/segment_loader.go:496] ["start loading remote..."] [traceID=4e91aa4208c61f437d96f5d4541b3030] [collectionID=448141181348513817] [segmentIDs="[448141181348514393]"] [segmentNum=1] [2024/03/18 11:55:17.787 +08:00] [INFO] [segments/segment_loader.go:506] ["loading bloom filter for remote..."] [traceID=4e91aa4208c61f437d96f5d4541b3030] [collectionID=448141181348513817] [segmentIDs="[448141181348514393]"] [2024/03/18 11:55:17.819 +08:00] [INFO] [segments/segment_loader.go:774] ["Successfully load pk stats"] [traceID=4e91aa4208c61f437d96f5d4541b3030] [segmentID=448141181348514393] [time=32.175293ms] [size=6779191] [2024/03/18 11:55:17.861 +08:00] [WARN] [cluster/worker.go:75] ["failed to call LoadSegments, worker return error"] [traceID=4e91aa4208c61f437d96f5d4541b3030] [workerID=38] [errorCode=UnexpectedError] [reason="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917817/1/448141181348513818/448141181348514393/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:55:17.861 +08:00] [WARN] [delegator/delegator_data.go:395] ["worker failed to load segments"] [traceID=4e91aa4208c61f437d96f5d4541b3030] [collectionID=448141181348513817] [channel=milvus_t0-rootcoord-dml_8_448141181348513817v0] [replicaID=448141181500522500] [workID=38] [segments="[448141181348514393]"] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917817/1/448141181348513818/448141181348514393/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:55:17.861 +08:00] [WARN] [querynodev2/services.go:454] ["delegator failed to load segments"] [traceID=4e91aa4208c61f437d96f5d4541b3030] [collectionID=448141181348513817] [partitionID=448141181348513818] [shard=milvus_t0-rootcoord-dml_8_448141181348513817v0] [segmentID=448141181348514393] [currentNodeID=39] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917817/1/448141181348513818/448141181348514393/SLICE_META: service internal error: UnknownError"]

The coord [2024/03/18 11:48:12.915 +08:00] [WARN] [task/executor.go:238] ["failed to load segment"] [taskID=1709522940657] [collectionID=448141181348513817] [replicaID=448141181500522500] [segmentID=448141181349315666] [node=38] [source=segment_checker] [shardLeader=39] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917451/4/448141181348513818/448141181349315666/SLICE_META: service internalerror: UnknownError"] [2024/03/18 11:48:12.915 +08:00] [INFO] [task/executor.go:119] ["execute action done, remove it"] [taskID=1709522940657] [step=0] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917451/4/448141181348513818/448141181349315666/SLICE_META: service internal error: UnknownError"][2024/03/18 11:48:13.051 +08:00] [WARN] [task/scheduler.go:727] ["task scheduler recordSegmentTaskError"] [taskID=1709522940655] [collectionID=448141181348513817] [replicaID=448141181500522500] [segmentID=448141181348714298] [status=failed] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917446/7/448141181348513818/448141181348714298/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:48:13.051 +08:00] [WARN] [meta/failed_load_cache.go:97] ["FailedLoadCache put failed record"] [collectionID=448141181348513817] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917446/7/448141181348513818/448141181348714298/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:48:13.051 +08:00] [INFO] [task/scheduler.go:768] ["task removed"] [taskID=1709522940655] [collectionID=448141181348513817] [replicaID=448141181500522500] [status=failed] [segmentID=448141181348714298][2024/03/18 11:48:13.051 +08:00] [WARN] [task/scheduler.go:727] ["task scheduler recordSegmentTaskError"] [taskID=1709522940656] [collectionID=448141181348513817] [replicaID=448141181500522500] [segmentID=448141181348914752] [status=failed] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917450/11/448141181348513818/448141181348914752/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:48:13.051 +08:00] [WARN] [meta/failed_load_cache.go:97] ["FailedLoadCache put failed record"] [collectionID=448141181348513817] [error="invalid local path:/home/service/var/data/milvus/data/index_files/448141181349917450/11/448141181348513818/448141181348914752/SLICE_META: service internal error: UnknownError"] [2024/03/18 11:48:13.051 +08:00] [INFO] [task/scheduler.go:768] ["task removed"] [taskID=1709522940656] [collectionID=448141181348513817] [replicaID=448141181500522500] [status=failed] [segmentID=448141181348914752]

Anything else?

No response

didalaviva avatar Mar 18 '24 05:03 didalaviva

  1. if you have one collection and one shard, then yes you can only use one datanode.
  2. for index, it based on how many segments you have. each indexnode will take only one index job.

For smaller deployment, scale up might be more efficient that scale out

xiaofan-luan avatar Mar 18 '24 05:03 xiaofan-luan

@didalaviva how did you deploy milvus cluster? it sounds like you are not using the official deployments? https://milvus.io/docs/install_cluster-milvusoperator.md

/assign @didalaviva /unassign

yanliang567 avatar Mar 18 '24 12:03 yanliang567

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Apr 20 '24 06:04 stale[bot]

i hava this issue too , the data increase , it will happen

yesyue avatar May 02 '24 02:05 yesyue

please attach the full milvus logs for investigation, thx

yanliang567 avatar May 04 '24 01:05 yanliang567

for each one shard, it will be only assigned to one datanode. this is by design.

under most cases one datanode with larger cpu cores would be good enough for your use cases.

for index, each task takes on node.

xiaofan-luan avatar May 05 '24 14:05 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 05 '24 01:06 stale[bot]