milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Querynode terminated with log: failed to Deserialize index, cardinal inner error

Open ThreadDao opened this issue 11 months ago • 17 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: cardinal-milvus-io-2.3-ef086dc-20240222
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. collection laion_stable_4 has 58m-768d+ data, and the schema is:
{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'int64_pk_5b', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}], 'enable_dynamic_field': True}
  1. reload collection (64 segments) -> concurrent requests: insert + delete + search + query image

  2. One querynode of total 4 terminated 134 error with error logs: (Since the cardinal is private, please get in touch with me for more detailed querynode terminated logs)

E20240226 16:13:52.003870   607 FileIo.cpp:25] [CARDINAL][FileReader][milvus] Failed to open file : /var/lib/milvus/data/querynode/index_files/447990444064979058/1/_mem.index.bin
E20240226 16:13:52.005385   607 cardinal.cc:368] [KNOWHERE][Deserialize][milvus] Cardinal Inner Exception: std::exception
I20240226 16:13:52.005625   607 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Load index: done (2.135270 ms)
 => failed to Deserialize index, cardinal inner error
non-Go function
    pc=0x7f58fbc2003b
non-Go function
    pc=0x7f58fbbff858
non-Go function
    pc=0x7f58fba998d0
non-Go function
    pc=0x7f58fbaa537b
non-Go function
    pc=0x7f58fbaa4358
non-Go function
    pc=0x7f58fbaa4d10
non-Go function
    pc=0x7f58fbde1bfe
runtime.cgocall(0x4749090, 0xc001774cd0)
    /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc001774ca8 sp=0xc001774c70 pc=0x1a627bc
github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_DeleteSegment(0x7f58f68d1700)
    _cgo_gotypes.go:475 +0x45 fp=0xc001774cd0 sp=0xc001774ca8 pc=0x4522a45
github.com/milvus-io/milvus/internal/querynodev2/segments.(*LocalSegment).Release.func1(0xc00175b2d8?)
    /go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment.go:1037 +0x3a fp=0xc001774d08 sp=0xc001774cd0 pc=0x453d07a
github.com/milvus-io/milvus/internal/querynodev2/segments.(*LocalSegment).Release(0xc00175b290)
    /go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment.go:1037 +0xa6 fp=0xc001774f48 sp=0xc001774d08 pc=0x453c826
github.com/milvus-io/milvus/internal/querynodev2/segments.remove({0x5744620, 0xc00175b290})
    /go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/manager.go:542 +0x42 fp=0xc001775010 sp=0xc001774f48 pc=0x452e5c2
github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentManager).Remove(0xc001620a80, 0x5158d57?, 0x3)
    /go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/manager.go:447 +0x2d5 fp=0xc0017750b0 sp=0xc001775010 pc=0x452d5b5

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

laion1b-test-2-etcd-0                                             1/1     Running             1 (46d ago)     74d     10.104.25.31    4am-node30   <none>           <none>
laion1b-test-2-etcd-1                                             1/1     Running             0               74d     10.104.30.94    4am-node38   <none>           <none>
laion1b-test-2-etcd-2                                             1/1     Running             0               74d     10.104.34.225   4am-node37   <none>           <none>
laion1b-test-2-milvus-datanode-7b7f99b8d4-g8v8q                   1/1     Running             0               20h     10.104.16.187   4am-node21   <none>           <none>
laion1b-test-2-milvus-datanode-7b7f99b8d4-t7lfp                   1/1     Running             0               20h     10.104.30.131   4am-node38   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-2kbqd                   1/1     Running             0               15h     10.104.14.112   4am-node18   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-d6q6m                   1/1     Running             0               15h     10.104.9.46     4am-node14   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-lg4k7                   1/1     Running             0               15h     10.104.34.47    4am-node37   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-q9hcx                   1/1     Running             0               15h     10.104.17.50    4am-node23   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-vtvx5                   1/1     Running             0               15h     10.104.29.240   4am-node35   <none>           <none>
laion1b-test-2-milvus-mixcoord-74b896d49d-ljz4l                   1/1     Running             0               20h     10.104.18.222   4am-node25   <none>           <none>
laion1b-test-2-milvus-proxy-5cdb5b7d6b-w5h29                      1/1     Running             0               20h     10.104.19.6     4am-node28   <none>           <none>
laion1b-test-2-milvus-querynode-0-7977c8fdbf-8pfz2                1/1     Running             0               15h     10.104.17.49    4am-node23   <none>           <none>
laion1b-test-2-milvus-querynode-0-7977c8fdbf-cb9mn                1/1     Running             0               15h     10.104.28.101   4am-node33   <none>           <none>
laion1b-test-2-milvus-querynode-0-7977c8fdbf-dd9hn                1/1     Running             0               15h     10.104.33.74    4am-node36   <none>           <none>
laion1b-test-2-milvus-querynode-0-7977c8fdbf-zfhhz                1/1     Running             1 (15h ago)     15h     10.104.32.10    4am-node39   <none>           <none>
laion1b-test-2-pulsar-bookie-0                                    1/1     Running             0               74d     10.104.33.107   4am-node36   <none>           <none>
laion1b-test-2-pulsar-bookie-1                                    1/1     Running             0               50d     10.104.18.240   4am-node25   <none>           <none>
laion1b-test-2-pulsar-bookie-2                                    1/1     Running             0               74d     10.104.25.32    4am-node30   <none>           <none>
laion1b-test-2-pulsar-broker-0                                    1/1     Running             0               68d     10.104.1.69     4am-node10   <none>           <none>
laion1b-test-2-pulsar-proxy-0                                     1/1     Running             0               74d     10.104.4.218    4am-node11   <none>           <none>
laion1b-test-2-pulsar-recovery-0                                  1/1     Running             0               74d     10.104.14.151   4am-node18   <none>           <none>
laion1b-test-2-pulsar-zookeeper-0                                 1/1     Running             0               74d     10.104.29.87    4am-node35   <none>           <none>
laion1b-test-2-pulsar-zookeeper-1                                 1/1     Running             0               74d     10.104.21.124   4am-node24   <none>           <none>
laion1b-test-2-pulsar-zookeeper-2                                 1/1     Running             0               74d     10.104.34.229   4am-node37   <none>           <none>
  • core dump file: /tmp/cores/core-laion1b-test-2-milvus-querynode-0-7977c8fdbf-zfhhz-milvus-8-1708964037 of 4am-node39

Anything else?

No response

ThreadDao avatar Feb 27 '24 08:02 ThreadDao

Perhaps it is because the dataCoord.channel.watchTimeoutInterval configuration is modified and the milvus is restarted. I mean when the qn restarts it looks like the tests haven't started yet

ThreadDao avatar Feb 27 '24 08:02 ThreadDao

/assign @liliu-z /unassign

yanliang567 avatar Feb 28 '24 01:02 yanliang567

/assign @foxspy /unassign @liliu-z

foxspy avatar Feb 28 '24 03:02 foxspy

image The root cause seems to be a concurrency bug between release and load. The qn release a segments while the index engine loading the index from file concurrently. And the index engine throwing an exception is as expected for the file not exist, which will not cause the qn coredump, and the actual cause of the coredump is the release operation.

foxspy avatar Feb 28 '24 03:02 foxspy

/assign @yanliang567 /unassign

foxspy avatar Feb 28 '24 03:02 foxspy

/assign @chyezh

chyezh avatar Feb 28 '24 06:02 chyezh

Index 447990444064979058 belongs to Segment 447990444064723266.

Node `` start to release segment while new load request is incoming.


[2024/02/26 16:13:34.948 +00:00] [INFO] [querynodev2/services.go:595] ["start to release segments"] [traceID=06cd8e504609e669a530c57535e33631] [collectionID=447902879639453431] [shard=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [segmentIDs="[447990444064723266]"] [currentNodeID=1681]

[2024/02/26 16:13:38.996 +00:00] [INFO] [querynodev2/services.go:433] ["received load segments request"] [traceID=7c9ecfdda5b9a070dc760feb81f2bf64] [collectionID=447902879639453431] [partitionID=447902879639453437] [shard=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [segmentID=447990444064723266] [currentNodeID=1681] [version=1708964018906870595] [needTransfer=false] [loadScope=Full]

Load repeat segment is checked by SegmentManager.

...
		if len(loader.manager.Segment.GetBy(WithType(segmentType), WithID(segment.GetSegmentID()))) == 0 &&
			!loader.loadingSegments.Contain(segment.GetSegmentID()) {
...

Release segment is remove the segment from SegmentManager then release the memory.

	case querypb.DataScope_Historical:
		sealed = mgr.removeSegmentWithType(SegmentTypeSealed, segmentID)
		if sealed != nil {
			removeSealed = 1
		}

	mgr.updateMetric()
	mgr.mu.Unlock()

	if sealed != nil {
		remove(sealed)
	}

Concurrent load and release happens.

chyezh avatar Feb 28 '24 07:02 chyezh

Short-term fix: Implement mutual exclusivity between Release and Load on QN; Long-term, it is necessary to implement lifecycle controls such as Loading, Loaded, Release state of Segment on QueryCoord.

chyezh avatar Feb 28 '24 08:02 chyezh

/unassign

yanliang567 avatar Feb 29 '24 01:02 yanliang567

@chyezh Loading segment will not be released in segment manager. In my opinion, concurrent load&release shall not happen for same segment. Could you please explain the detail how it went?

congqixia avatar Feb 29 '24 02:02 congqixia

@chyezh Loading segment will not be released in segment manager. In my opinion, concurrent load&release shall not happen for same segment. Could you please explain the detail how it went?

Load is triggered when segment is releasing, but not release is triggered when segment is loading.

The release segment operation is divided into two steps on query node.

  1. Removing the segment from the segmentManager (after this, the SegmentLoader is allowed to reload this segment),
  2. Releasing the actual segment.

chyezh avatar Feb 29 '24 02:02 chyezh

@chyezh Loading segment will not be released in segment manager. In my opinion, concurrent load&release shall not happen for same segment. Could you please explain the detail how it went?

Load is triggered when segment is releasing, but not release is triggered when segment is loading.

The release segment operation is divided into two steps on query node.

1. Removing the segment from the segmentManager (after this, the SegmentLoader is allowed to reload this segment),

2. Releasing the actual segment.

@chyezh got it, thanks!

congqixia avatar Feb 29 '24 02:02 congqixia

After some offline discussion, the final solution shall be separating the disk resource for different segment life-cycle.

One more thing, it's looks weird that a segment is released than loaded back. Maybe the segment was bouncing between querynode?

congqixia avatar Feb 29 '24 02:02 congqixia

After some offline discussion, the final solution shall be separating the disk resource for different segment life-cycle.

One more thing, it's looks weird that a segment is released than loaded back. Maybe the segment was bouncing between querynode?

  • segment is released on QN for collection released.
  • segment is reloaded for segment checker(lack of segment), updated by Distribution?
[2024/02/26 16:13:34.498 +00:00] [INFO] [task/scheduler.go:269] ["task added"] [task="[id=1708948067586] [type=Reduce] [source=segment_checker] [reason=collection released] [collectionID=447902879639453431] [replicaID=-1] [priority=Normal] [actionsCount=1] [actions={[type=Reduce][node=1681][streaming=false]}] [segmentID=447990444064723266]"]

[2024/02/26 16:13:38.501 +00:00] [INFO] [task/scheduler.go:269] ["task added"] [task="[id=1708948067608] [type=Grow] [source=segment_checker] [reason=lacks of segment] [collectionID=447902879639453431] [replicaID=447990457955778562] [priority=Normal] [actionsCount=1] [actions={[type=Grow][node=1681][streaming=false]}] [segmentID=447990444064723266]"]

chyezh avatar Feb 29 '24 06:02 chyezh

Release then load collection. Or concurrent release and load collection can reproduce it.

2024-02-27 00:13:34.487	[2024/02/26 16:13:34.487 +00:00] [INFO] [querycoordv2/services.go:254] ["release collection request received"] [traceID=458948ca161f98b29f6d8118b6001ae5] [collectionID=447902879639453431]
2024-02-27 00:13:34.498	[2024/02/26 16:13:34.498 +00:00] [INFO] [task/scheduler.go:269] ["task added"] [task="[id=1708948067586] [type=Reduce] [source=segment_checker] [reason=collection released] [collectionID=447902879639453431] [replicaID=-1] [priority=Normal] [actionsCount=1] [actions={[type=Reduce][node=1681][streaming=false]}] [segmentID=447990444064723266]"]
2024-02-27 00:13:34.976	[2024/02/26 16:13:34.976 +00:00] [INFO] [task/executor.go:104] ["execute the action of task"] [taskID=1708948067586] [collectionID=447902879639453431] [replicaID=-1] [step=0] [source=segment_checker]
2024-02-27 00:13:34.977	[2024/02/26 16:13:34.976 +00:00] [INFO] [task/executor.go:298] ["release segment..."] [taskID=1708948067586] [collectionID=447902879639453431] [replicaID=-1] [segmentID=447990444064723266] [node=1681] [source=segment_checker]
2024-02-27 00:13:35.469	[2024/02/26 16:13:35.469 +00:00] [INFO] [task/scheduler.go:768] ["task removed"] [taskID=1708948067586] [collectionID=447902879639453431] [replicaID=-1] [status=succeeded] [segmentID=447990444064723266]
2024-02-27 00:13:35.470	[2024/02/26 16:13:35.470 +00:00] [WARN] [task/executor.go:301] ["failed to release segment, it may be a false failure"] [taskID=1708948067586] [collectionID=447902879639453431] [replicaID=-1] [segmentID=447990444064723266] [node=1681] [source=segment_checker] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:192 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).ReleaseSegments\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:164 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments.func1\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:271 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).send\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:161 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:299 github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:135 github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction: attempt #0: rpc error: code = Canceled desc = context canceled: context canceled"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: attempt #0: rpc error: code = Canceled desc = context canceled: context canceled\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).ReleaseSegments\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:192\n  | github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments.func1\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:164\n  | github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).send\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:271\n  | github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:161\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:299\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:135\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).Execute.func1\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:107\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:192 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).ReleaseSegments\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:164 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments.func1\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:271 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).send\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:161 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:299 github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:135 github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction\nWraps: (3) attempt #0: rpc error: code = Canceled desc = context canceled\nWraps: (4) context canceled\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.multiErrors (4) *errors.errorString"]
2024-02-27 00:13:35.790	[2024/02/26 16:13:35.790 +00:00] [INFO] [querycoordv2/services.go:197] ["load collection request received"] [traceID=6019913291c1f2be024e5909f5edd21f] [collectionID=447902879639453431] [replicaNumber=1] [resourceGroups="[]"] [refreshMode=false] [schema="name:\"laion_stable_4\" fields:<fieldID:100 name:\"id\" is_primary_key:true data_type:Int64 > fields:<fieldID:101 name:\"float_vector\" data_type:FloatVector type_params:<key:\"dim\" value:\"768\" > > fields:<fieldID:102 name:\"int64_pk_5b\" data_type:Int64 is_partition_key:true > fields:<fieldID:103 name:\"varchar_caption\" data_type:VarChar type_params:<key:\"max_length\" value:\"8192\" > > fields:<fieldID:104 name:\"varchar_NSFW\" data_type:VarChar type_params:<key:\"max_length\" value:\"8192\" > > fields:<fieldID:105 name:\"float64_similarity\" data_type:Float > fields:<fieldID:106 name:\"int64_width\" data_type:Int64 > fields:<fieldID:107 name:\"int64_height\" data_type:Int64 > fields:<fieldID:108 name:\"int64_original_width\" data_type:Int64 > fields:<fieldID:109 name:\"int64_original_height\" data_type:Int64 > fields:<fieldID:110 name:\"varchar_md5\" data_type:VarChar type_params:<key:\"max_length\" value:\"8192\" > > fields:<fieldID:111 name:\"$meta\" description:\"dynamic schema\" data_type:JSON is_dynamic:true > enable_dynamic_field:true "] [fieldIndexes="[447902879639453513,447902879639453519,447902879639453502,447902879639453508]"]
2024-02-27 00:13:38.501	[2024/02/26 16:13:38.501 +00:00] [INFO] [task/scheduler.go:269] ["task added"] [task="[id=1708948067608] [type=Grow] [source=segment_checker] [reason=lacks of segment] [collectionID=447902879639453431] [replicaID=447990457955778562] [priority=Normal] [actionsCount=1] [actions={[type=Grow][node=1681][streaming=false]}] [segmentID=447990444064723266]"]
2024-02-27 00:13:38.608	[2024/02/26 16:13:38.608 +00:00] [INFO] [task/executor.go:104] ["execute the action of task"] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [step=0] [source=segment_checker]
2024-02-27 00:13:38.906	[2024/02/26 16:13:38.906 +00:00] [INFO] [task/executor.go:230] ["load segments..."] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [segmentID=447990444064723266] [node=1681] [source=segment_checker] [shardLeader=1679]
2024-02-27 00:14:02.610	[2024/02/26 16:14:02.609 +00:00] [WARN] [task/executor.go:238] ["failed to load segment"] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [segmentID=447990444064723266] [node=1681] [source=segment_checker] [shardLeader=1679] [error="unrecoverable error"]
2024-02-27 00:14:02.610	[2024/02/26 16:14:02.609 +00:00] [INFO] [task/executor.go:119] ["execute action done, remove it"] [taskID=1708948067608] [step=0] [error="unrecoverable error"]
2024-02-27 00:14:02.623	[2024/02/26 16:14:02.623 +00:00] [WARN] [task/scheduler.go:727] ["task scheduler recordSegmentTaskError"] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [segmentID=447990444064723266] [status=failed] [error="unrecoverable error"]
2024-02-27 00:14:02.623	[2024/02/26 16:14:02.623 +00:00] [INFO] [task/scheduler.go:768] ["task removed"] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [status=failed] [segmentID=447990444064723266]

chyezh avatar Mar 01 '24 07:03 chyezh

@chyezh

  • image: cardinal-milvus-io-2.3-3c90475-20240311
  • queryNode laion1b-test-2-milvus-querynode-1-86cfff6f5d-7b2lv terminated with 134 error at 2024-03-12 16:02:40.814(UTC) image

ThreadDao avatar Mar 14 '24 07:03 ThreadDao

Short-term fix: Implement mutual exclusivity between Release and Load on QN;

ThreadDao avatar Apr 25 '24 07:04 ThreadDao

should be fixed at 2.4.5, please verify it.

chyezh avatar Jun 14 '24 07:06 chyezh

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 11 '24 10:09 stale[bot]