milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: panic when ack of broadcaster

Open chyezh opened this issue 1 week ago • 3 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: f8c972a102d82878fdfadbbacf23f2127fb29d20
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


panic: close of closed channel | panic: close of closed channel |  
-- | -- | --
panic: close of closed channel |  
(no unique labels)(no unique labels)goroutine 1180 gp=0xc00337f880 m=8 mp=0xc000a80008 [running]:(no unique labels)panic({0x6f776e0?, 0x81e65a0?})(no unique labels)	/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:811 +0x168 fp=0xc002dcf2e0 sp=0xc002dcf230 pc=0x28589e8(no unique labels)runtime.closechan(0xc000f4df10)(no unique labels)	/go/pkg/mod/golang.org/[email protected]/src/runtime/chan.go:422 +0x3cf fp=0xc002dcf338 sp=0xc002dcf2e0 pc=0x27ec48f(no unique labels)github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTask).ack(0xc000fa7080, {0x8254818, 0xc003f81c50}, {0xc0029a54f0?, 0xffffffffffffffff?, 0x100c0029a53e8?})(no unique labels)	/workspace/source/internal/streamingcoord/server/broadcaster/broadcast_task.go:251 +0xf7 fp=0xc002dcf380 sp=0xc002dcf338 pc=0x5d9e157(no unique labels)github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTask).Ack(0xc000feadc0?, {0x8254818?, 0xc003f81c50?}, {0xc0029a54f0?, 0x831f900?, 0xc0029a5458?})(no unique labels)	/workspace/source/internal/streamingcoord/server/broadcaster/broadcast_task.go:228 +0xe5 fp=0xc002dcf3f8 sp=0xc002dcf380 pc=0x5d9df65(no unique labels)github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTaskManager).Ack(0xc000feadc0, {0x8254818, 0xc003f81c50}, {0x82d9280, 0xc003fab5c0})(no unique labels)	/workspace/source/internal/streamingcoord/server/broadcaster/broadcast_manager.go:217 +0x330 fp=0xc002dcf530 sp=0xc002dcf3f8 pc=0x5d9a310(no unique labels)github.com/milvus-io/milvus/internal/streamingcoord/server/service.(*broadcastServceImpl).Ack(0xb51b590?, {0x82547a8, 0xc003fab530}, 0xc0040960a0)(no unique labels)	/workspace/source/internal/streamingcoord/server/service/broadcast.go:67 +0x18a fp=0xc002dcf5c8 sp=0xc002dcf530 pc=0x6184aaa(no unique labels)github.com/milvus-io/milvus/pkg/v2/proto/streamingpb._StreamingCoordBroadcastService_Ack_Handler.func1({0x82547a8?, 0xc003fab530?}, {0x74c3b20?, 0xc0040960a0?})(no unique labels)	/workspace/source/pkg/proto/streamingpb/streaming_grpc.pb.go:216 +0xcb fp=0xc002dcf600 sp=0xc002dcf5c8 pc=0x37b9b8b(no unique labels)github.com/milvus-io/milvus/internal/distributed/mixcoord.(*Server).startGrpcLoop.ServerIDValidationUnaryServerInterceptor.func8({0x82547a8, 0xc003fab530}, {0x74c3b20, 0xc0040960a0}, 0x38b8c54?, 0xc003f9e648) |   |   | (no unique labels) |   |   |   |   | (no unique labels) | goroutine 1180 gp=0xc00337f880 m=8 mp=0xc000a80008 [running]: |   |   |   | (no unique labels) | panic({0x6f776e0?, 0x81e65a0?}) |   |   |   | (no unique labels) | /go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:811 +0x168 fp=0xc002dcf2e0 sp=0xc002dcf230 pc=0x28589e8 |   |   |   | (no unique labels) | runtime.closechan(0xc000f4df10) |   |   |   | (no unique labels) | /go/pkg/mod/golang.org/[email protected]/src/runtime/chan.go:422 +0x3cf fp=0xc002dcf338 sp=0xc002dcf2e0 pc=0x27ec48f |   |   |   | (no unique labels) | github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTask).ack(0xc000fa7080, {0x8254818, 0xc003f81c50}, {0xc0029a54f0?, 0xffffffffffffffff?, 0x100c0029a53e8?}) |   |   |   | (no unique labels) | /workspace/source/internal/streamingcoord/server/broadcaster/broadcast_task.go:251 +0xf7 fp=0xc002dcf380 sp=0xc002dcf338 pc=0x5d9e157 |   |   |   | (no unique labels) | github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTask).Ack(0xc000feadc0?, {0x8254818?, 0xc003f81c50?}, {0xc0029a54f0?, 0x831f900?, 0xc0029a5458?}) |   |   |   | (no unique labels) | /workspace/source/internal/streamingcoord/server/broadcaster/broadcast_task.go:228 +0xe5 fp=0xc002dcf3f8 sp=0xc002dcf380 pc=0x5d9df65 |   |   |   | (no unique labels) | github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTaskManager).Ack(0xc000feadc0, {0x8254818, 0xc003f81c50}, {0x82d9280, 0xc003fab5c0}) |   |   |   | (no unique labels) | /workspace/source/internal/streamingcoord/server/broadcaster/broadcast_manager.go:217 +0x330 fp=0xc002dcf530 sp=0xc002dcf3f8 pc=0x5d9a310 |   |   |   | (no unique labels) | github.com/milvus-io/milvus/internal/streamingcoord/server/service.(*broadcastServceImpl).Ack(0xb51b590?, {0x82547a8, 0xc003fab530}, 0xc0040960a0) |   |   |   | (no unique labels) | /workspace/source/internal/streamingcoord/server/service/broadcast.go:67 +0x18a fp=0xc002dcf5c8 sp=0xc002dcf530 pc=0x6184aaa |   |   |   | (no unique labels) | github.com/milvus-io/milvus/pkg/v2/proto/streamingpb._StreamingCoordBroadcastService_Ack_Handler.func1({0x82547a8?, 0xc003fab530?}, {0x74c3b20?, 0xc0040960a0?}) |   |   |   | (no unique labels) | /workspace/source/pkg/proto/streamingpb/streaming_grpc.pb.go:216 +0xcb fp=0xc002dcf600 sp=0xc002dcf5c8 pc=0x37b9b8b |   |   |   | (no unique labels) | github.com/milvus-io/milvus/internal/distributed/mixcoord.(*Server).startGrpcLoop.ServerIDValidationUnaryServerInterceptor.func8({0x82547a8, 0xc003fab530}, {0x74c3b20, 0xc0040960a0}, 0x38b8c54?, 0xc003f9e648) |  
  |   | (no unique labels) |   |  
  |   | (no unique labels) | goroutine 1180 gp=0xc00337f880 m=8 mp=0xc000a80008 [running]: |  
  |   | (no unique labels) | panic({0x6f776e0?, 0x81e65a0?}) |  
  |   | (no unique labels) | /go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:811 +0x168 fp=0xc002dcf2e0 sp=0xc002dcf230 pc=0x28589e8 |  
  |   | (no unique labels) | runtime.closechan(0xc000f4df10) |  
  |   | (no unique labels) | /go/pkg/mod/golang.org/[email protected]/src/runtime/chan.go:422 +0x3cf fp=0xc002dcf338 sp=0xc002dcf2e0 pc=0x27ec48f |  
  |   | (no unique labels) | github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTask).ack(0xc000fa7080, {0x8254818, 0xc003f81c50}, {0xc0029a54f0?, 0xffffffffffffffff?, 0x100c0029a53e8?}) |  
  |   | (no unique labels) | /workspace/source/internal/streamingcoord/server/broadcaster/broadcast_task.go:251 +0xf7 fp=0xc002dcf380 sp=0xc002dcf338 pc=0x5d9e157 |  
  |   | (no unique labels) | github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTask).Ack(0xc000feadc0?, {0x8254818?, 0xc003f81c50?}, {0xc0029a54f0?, 0x831f900?, 0xc0029a5458?}) |  
  |   | (no unique labels) | /workspace/source/internal/streamingcoord/server/broadcaster/broadcast_task.go:228 +0xe5 fp=0xc002dcf3f8 sp=0xc002dcf380 pc=0x5d9df65 |  
  |   | (no unique labels) | github.com/milvus-io/milvus/internal/streamingcoord/server/broadcaster.(*broadcastTaskManager).Ack(0xc000feadc0, {0x8254818, 0xc003f81c50}, {0x82d9280, 0xc003fab5c0}) |  
  |   | (no unique labels) | /workspace/source/internal/streamingcoord/server/broadcaster/broadcast_manager.go:217 +0x330 fp=0xc002dcf530 sp=0xc002dcf3f8 pc=0x5d9a310 |  
  |   | (no unique labels) | github.com/milvus-io/milvus/internal/streamingcoord/server/service.(*broadcastServceImpl).Ack(0xb51b590?, {0x82547a8, 0xc003fab530}, 0xc0040960a0) |  
  |   | (no unique labels) | /workspace/source/internal/streamingcoord/server/service/broadcast.go:67 +0x18a fp=0xc002dcf5c8 sp=0xc002dcf530 pc=0x6184aaa |  
  |   | (no unique labels) | github.com/milvus-io/milvus/pkg/v2/proto/streamingpb._StreamingCoordBroadcastService_Ack_Handler.func1({0x82547a8?, 0xc003fab530?}, {0x74c3b20?, 0xc0040960a0?}) |  
  |   | (no unique labels) | /workspace/source/pkg/proto/streamingpb/streaming_grpc.pb.go:216 +0xcb fp=0xc002dcf600 sp=0xc002dcf5c8 pc=0x37b9b8b |  
  |   | (no unique labels) | github.com/milvus-io/milvus/internal/distributed/mixcoord.(*Server).startGrpcLoop.ServerIDValidationUnaryServerInterceptor.func8({0x82547a8, 0xc003fab530}, {0x74c3b20, 0xc0040960a0}, 0x38b8c54?, 0xc003f9e648) |  

<br class="Apple-interchange-newline">

Expected Behavior

No response

Steps To Reproduce


Milvus Log

No response

Anything else?

No response

chyezh avatar Nov 18 '25 02:11 chyezh

/assign @chyezh

chyezh avatar Nov 18 '25 02:11 chyezh

Meanwhile, the tombstone kept of downstream is too less. we need more tombstone to avoid double acked.

2025-11-18 11:14:49.063	[2025/11/18 03:14:49.063 +00:00] [INFO] [broadcaster/broadcast_task.go:425] ["save broadcast task done"] [module=streamingcoord] [component=broadcaster] [message="{type=CreateCollection,vchannel=cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,timetick=462274085208981522,broadcastID=462274027807573005,broadcastVChannels=cdc-test-downstream-390-rootcoord-dml_0_vcchan,cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,cdc-test-downstream-390-rootcoord-dml_13_462274027807775233v1,rClusterID=cdc-test-upstream-390,rMessageID=4,rLastConfirmedMessageID=2,rTimeTick=462274085172019219,rVchannel=cdc-test-upstream-390-rootcoord-dml_12_462274027807775233v0,size=1084,collectionID=462274027807775233}"] [state=BROADCAST_TASK_STATE_REPLICATED] [ackedVChannelCount=1]
2025-11-18 11:14:49.069	[2025/11/18 03:14:49.069 +00:00] [INFO] [broadcaster/broadcast_task.go:425] ["save broadcast task done"] [module=streamingcoord] [component=broadcaster] [message="{type=CreateCollection,vchannel=cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,timetick=462274085208981522,broadcastID=462274027807573005,broadcastVChannels=cdc-test-downstream-390-rootcoord-dml_0_vcchan,cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,cdc-test-downstream-390-rootcoord-dml_13_462274027807775233v1,rClusterID=cdc-test-upstream-390,rMessageID=4,rLastConfirmedMessageID=2,rTimeTick=462274085172019219,rVchannel=cdc-test-upstream-390-rootcoord-dml_12_462274027807775233v0,size=1084,collectionID=462274027807775233}"] [state=BROADCAST_TASK_STATE_REPLICATED] [ackedVChannelCount=2]
2025-11-18 11:14:49.301	[2025/11/18 03:14:49.301 +00:00] [INFO] [broadcaster/broadcast_task.go:425] ["save broadcast task done"] [module=streamingcoord] [component=broadcaster] [message="{type=CreateCollection,vchannel=cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,timetick=462274085208981522,broadcastID=462274027807573005,broadcastVChannels=cdc-test-downstream-390-rootcoord-dml_0_vcchan,cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,cdc-test-downstream-390-rootcoord-dml_13_462274027807775233v1,rClusterID=cdc-test-upstream-390,rMessageID=4,rLastConfirmedMessageID=2,rTimeTick=462274085172019219,rVchannel=cdc-test-upstream-390-rootcoord-dml_12_462274027807775233v0,size=1084,collectionID=462274027807775233}"] [state=BROADCAST_TASK_STATE_REPLICATED] [ackedVChannelCount=3]
2025-11-18 11:14:49.301	[2025/11/18 03:14:49.301 +00:00] [INFO] [broadcaster/ack_callback_scheduler.go:145] ["start to execute ack callback"] [module=streamingcoord] [component=broadcaster] [broadcastID=462274027807573005]
2025-11-18 11:14:49.301	[2025/11/18 03:14:49.301 +00:00] [DEBUG] [broadcaster/ack_callback_scheduler.go:149] ["all vchannels are acked"] [module=streamingcoord] [component=broadcaster] [broadcastID=462274027807573005]
2025-11-18 11:14:49.329	[2025/11/18 03:14:49.329 +00:00] [DEBUG] [broadcaster/ack_callback_scheduler.go:167] ["ack callback done"] [module=streamingcoord] [component=broadcaster] [broadcastID=462274027807573005]
2025-11-18 11:14:49.331	[2025/11/18 03:14:49.330 +00:00] [INFO] [broadcaster/broadcast_task.go:425] ["save broadcast task done"] [module=streamingcoord] [component=broadcaster] [message="{type=CreateCollection,vchannel=cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,timetick=462274085208981522,broadcastID=462274027807573005,broadcastVChannels=cdc-test-downstream-390-rootcoord-dml_0_vcchan,cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,cdc-test-downstream-390-rootcoord-dml_13_462274027807775233v1,rClusterID=cdc-test-upstream-390,rMessageID=4,rLastConfirmedMessageID=2,rTimeTick=462274085172019219,rVchannel=cdc-test-upstream-390-rootcoord-dml_12_462274027807775233v0,size=1084,collectionID=462274027807775233}"] [state=BROADCAST_TASK_STATE_TOMBSTONE] [ackedVChannelCount=3]
2025-11-18 11:14:49.331	[2025/11/18 03:14:49.330 +00:00] [INFO] [broadcaster/ack_callback_scheduler.go:140] ["execute ack callback done"] [module=streamingcoord] [component=broadcaster] [broadcastID=462274027807573005]
2025-11-18 11:17:02.551	[2025/11/18 03:17:02.551 +00:00] [DEBUG] [broadcaster/broadcast_manager.go:211] ["task is tombstone, ignored the ack request"] [module=streamingcoord] [component=broadcaster] [broadcastID=462274027807573005] [vchannel=cdc-test-downstream-390-rootcoord-dml_0_vcchan]
2025-11-18 11:21:36.902	[2025/11/18 03:21:36.902 +00:00] [INFO] [broadcaster/broadcast_task.go:425] ["save broadcast task done"] [module=streamingcoord] [component=broadcaster] [message="{type=CreateCollection,vchannel=cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,timetick=462274085208981522,broadcastID=462274027807573005,broadcastVChannels=cdc-test-downstream-390-rootcoord-dml_0_vcchan,cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,cdc-test-downstream-390-rootcoord-dml_13_462274027807775233v1,rClusterID=cdc-test-upstream-390,rMessageID=4,rLastConfirmedMessageID=2,rTimeTick=462274085172019219,rVchannel=cdc-test-upstream-390-rootcoord-dml_12_462274027807775233v0,size=1084,collectionID=462274027807775233}"] [state=BROADCAST_TASK_STATE_DONE] [ackedVChannelCount=3]
2025-11-18 11:24:46.391	[2025/11/18 03:24:46.391 +00:00] [INFO] [broadcaster/broadcast_task.go:425] ["save broadcast task done"] [module=streamingcoord] [component=broadcaster] [message="{type=CreateCollection,vchannel=cdc-test-downstream-390-rootcoord-dml_0_vcchan,timetick=462274085261410334,broadcastID=462274027807573005,broadcastVChannels=cdc-test-downstream-390-rootcoord-dml_0_vcchan,cdc-test-downstream-390-rootcoord-dml_12_462274027807775233v0,cdc-test-downstream-390-rootcoord-dml_13_462274027807775233v1,rClusterID=cdc-test-upstream-390,rMessageID=67,rLastConfirmedMessageID=65,rTimeTick=462274085172019220,rVchannel=cdc-test-upstream-390-rootcoord-dml_0_vcchan,size=1084,collectionID=462274027807775233}"] [state=BROADCAST_TASK_STATE_REPLICATED] [ackedVChannelCount=1]
2025-11-18 11:24:46.391	[2025/11/18 03:24:46.391 +00:00] [INFO] [broadcaster/ack_callback_scheduler.go:145] ["start to execute ack callback"] [module=streamingcoord] [component=broadcaster] [broadcastID=462274027807573005]
2025-11-18 11:26:35.280	[2025/11/18 03:26:35.280 +00:00] [WARN] [broadcaster/ack_callback_scheduler.go:142] ["execute ack callback failed"] [module=streamingcoord] [component=broadcaster] [broadcastID=462274027807573005] [error="context canceled"]
2025-11-18 11:26:39.551	[2025/11/18 03:26:39.536 +00:00] [INFO] [broadcaster/ack_callback_scheduler.go:145] ["start to execute ack callback"] [module=streamingcoord] [component=broadcaster] [broadcastID=462274027807573005]

chyezh avatar Nov 18 '25 06:11 chyezh

verification passed.

https://qa-jenkins.milvus.io/job/milvus_cdc_chaos_test/392/

chyezh avatar Nov 18 '25 08:11 chyezh

should be fixed.

chyezh avatar Nov 19 '25 07:11 chyezh