milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Data node crash loop on missing data

Open ThomasAlxDmy opened this issue 2 years ago • 16 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.2.2
- Deployment mode(standalone or cluster): cluster (ansible)
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1.0.dev98
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Data node keep crashing, not being able to back up data to MinIO, same errors happen after each restart. Is there a way I can mitigate this?

Not sure what this error means? it's missing but somehow referenced?

Expected Behavior

Restart docker container and error would go away? Any Settings we can use skip missing data?

Steps To Reproduce

No response

Milvus Log

[2023/01/05 02:32:49.057 +00:00] [DEBUG] [datanode/flow_graph_dd_node.go:329] ["DDNode sent delta timeTick"] [collectionID=438376139153874054] [ts=438537019444953089] [ts_p=2023/01/05 02:32:48.952 +00:00] [channel=by-dev-rootcoord-dml_40_438376139153874054v0]
[2023/01/05 02:32:49.075 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028903]
[2023/01/05 02:32:49.075 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028903] [error="failed to get segment 438376139150028903"]
[2023/01/05 02:32:49.075 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028846]
[2023/01/05 02:32:49.075 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028846] [error="failed to get segment 438376139150028846"]
[2023/01/05 02:32:49.094 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028968]
[2023/01/05 02:32:49.094 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028968] [error="failed to get segment 438376139150028968"]
[2023/01/05 02:32:49.094 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028791]
[2023/01/05 02:32:49.094 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028791] [error="failed to get segment 438376139150028791"]
[2023/01/05 02:32:49.129 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028995]
[2023/01/05 02:32:49.129 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028995] [error="failed to get segment 438376139150028995"]
[2023/01/05 02:32:49.151 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028972]
[2023/01/05 02:32:49.151 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028972] [error="failed to get segment 438376139150028972"]
[2023/01/05 02:32:49.151 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028896]
[2023/01/05 02:32:49.151 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028896] [error="failed to get segment 438376139150028896"]
[2023/01/05 02:32:49.174 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150029035]
[2023/01/05 02:32:49.174 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150029035] [error="failed to get segment 438376139150029035"]
[2023/01/05 02:32:49.174 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028998]
[2023/01/05 02:32:49.174 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028998] [error="failed to get segment 438376139150028998"]
[2023/01/05 02:32:49.185 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028950]
[2023/01/05 02:32:49.185 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028950] [error="failed to get segment 438376139150028950"]
[2023/01/05 02:32:49.211 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028900]
[2023/01/05 02:32:49.211 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028900] [error="failed to get segment 438376139150028900"]
[2023/01/05 02:32:49.341 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028922]
[2023/01/05 02:32:49.341 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028922] [error="failed to get segment 438376139150028922"]
[2023/01/05 02:32:51.771 +00:00] [WARN] [datanode/flush_manager.go:849] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028774] [error="All attempts results:\nattempt #1:segment 438376139150028774 not found\nattempt #2:segment 438376139150028774 not found\nattempt #3:segment 438376139150028774 not found\nattempt #4:segment 438376139150028774 not found\nattempt #5:segment 438376139150028774 not found\nattempt #6:segment 438376139150028774 not found\nattempt #7:segment 438376139150028774 not found\nattempt #8:segment 438376139150028774 not found\nattempt #9:segment 438376139150028774 not found\nattempt #10:segment 438376139150028774 not found\nattempt #11:segment 438376139150028774 not found\n"]
panic: All attempts results:
attempt #1:segment 438376139150028774 not found
attempt #2:segment 438376139150028774 not found
attempt #3:segment 438376139150028774 not found
attempt #4:segment 438376139150028774 not found
attempt #5:segment 438376139150028774 not found
attempt #6:segment 438376139150028774 not found
attempt #7:segment 438376139150028774 not found
attempt #8:segment 438376139150028774 not found
attempt #9:segment 438376139150028774 not found
attempt #10:segment 438376139150028774 not found
attempt #11:segment 438376139150028774 not found


goroutine 18651 [running]:
github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc004230b90)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:853 +0x1296
github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc000c36180, 0xc0012d4840, 0xc0044acd30)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204 +0xdb
created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1
	/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:121 +0xf1

Anything else?

No response

ThomasAlxDmy avatar Jan 05 '23 02:01 ThomasAlxDmy

https://github.com/milvus-io/milvus/blob/master/internal/datanode/flush_manager.go#L851-L852

@congqixia you might know about this, any idea of what the problem is and how it can be mitigated?

ThomasAlxDmy avatar Jan 05 '23 02:01 ThomasAlxDmy

Hi @ThomasAlxDmy Could you offer logs about your datacoord and datanode? My guess is those segments are from a dropped collection, if for that case we can simply ignore the error than abort the service.

xiaofan-luan avatar Jan 05 '23 03:01 xiaofan-luan

/assign @xiaofan-luan

xiaofan-luan avatar Jan 05 '23 03:01 xiaofan-luan

	log.Info("SaveBinlogPath",
		zap.Int64("SegmentID", pack.segmentID),
		zap.Int64("CollectionID", dsService.collectionID),
		zap.Any("startPos", startPos),
		zap.Any("checkPoints", checkPoints),
		zap.Int("Length of Field2BinlogPaths", len(fieldInsert)),
		zap.Int("Length of Field2Stats", len(fieldStats)),
		zap.Int("Length of Field2Deltalogs", len(deltaInfos[0].GetBinlogs())),
		zap.String("vChannelName", dsService.vchannelName),
	)

xiaofan-luan avatar Jan 05 '23 03:01 xiaofan-luan

from this log you will know the collectionID of these segments, and datacoord log will show if the collection is dropped. from the log your offered it only shows warn message, is there a way to show more?

If it's from a drop collection, One quick fix to let it work is to use birdwatcher tool to push forward checkpoint.(May loose some streaming data if you have other collection data in the stream)

xiaofan-luan avatar Jan 05 '23 03:01 xiaofan-luan

sounds about right:

Here's the logs from data-coordinator node:

[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143574409]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139142948494]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143363880]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143782385]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143363706]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143984223]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143988047]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143155733]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143369471]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143783160]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143575036]
[2023/01/05 03:33:34.391 +00:00] [WARN] [datacoord/handler.go:310] ["failed to load collection from rootcoord"] [collectionID=438376139142926320] [error="code: CollectionNotExists, reason: can't find collection: 438376139142926320"]
[2023/01/05 03:33:34.391 +00:00] [WARN] [datacoord/compaction_trigger.go:349] ["get compact time failed, skip to handle compaction"] [collectionID=438376139142926320] [partitionID=438376139142926321] [channel=by-dev-rootcoord-dml_19_438376139142926320v1]
[2023/01/05 03:33:34.558 +00:00] [DEBUG] [datacoord/services.go:877] [DataCoord.GetMetrics] [node_id=69] [req="{\"metric_type\":\"system_info\"}"] [metric_type=system_info] [metrics="status:<> response:\"{\\\"cluster\\\":{\\\"self\\\":{\\\"has_error\\\":false,\\\"error_reason\\\":\\\"\\\",\\\"name\\\":\\\"datacoord69\\\",\\\"hardware_infos\\\":{\\\"ip\\\":\\\"10.0.0.249:13333\\\",\\\"cpu_core_count\\\":4,\\\"cpu_core_usage\\\":4.865771812735999,\\\"memory\\\":16818016256,\\\"memory_usage\\\":105013248,\\\"disk\\\":104857600,\\\"disk_usage\\\":2097152},\\\"system_info\\\":{\\\"system_version\\\":\\\"0fdc1a04\\\",\\\"deploy_mode\\\":\\\"DISTRIBUTED\\\",\\\"build_version\\\":\\\"v2.2.2\\\",\\\"build_time\\\":\\\"Thu Dec 22 09:08:37 UTC 2022\\\",\\\"used_go_version\\\":\\\"go version go1.18.3 linux/amd64\\\"},\\\"created_time\\\":\\\"2023-01-05 01:54:34.339885567 +0000 UTC m=+23.456467038\\\",\\\"updated_time\\\":\\\"2023-01-05 01:54:34.339887409 +0000 UTC m=+23.456468816\\\",\\\"type\\\":\\\"datacoord\\\",\\\"id\\\":69,\\\"system_configurations\\\":{\\\"segment_max_size\\\":512},\\\"quota_metrics\\\":{\\\"TotalBinlogSize\\\":7543221972}},\\\"connected_nodes\\\":[{\\\"has_error\\\":false,\\\"error_reason\\\":\\\"\\\",\\\"name\\\":\\\"datanode116\\\",\\\"hardware_infos\\\":{\\\"ip\\\":\\\"10.0.0.72:21124\\\",\\\"cpu_core_count\\\":8,\\\"cpu_core_usage\\\":0.8760951187187002,\\\"memory\\\":16332128256,\\\"memory_usage\\\":1879347200,\\\"disk\\\":104857600,\\\"disk_usage\\\":2097152},\\\"system_info\\\":{\\\"system_version\\\":\\\"0fdc1a04\\\",\\\"deploy_mode\\\":\\\"DISTRIBUTED\\\",\\\"build_version\\\":\\\"v2.2.2\\\",\\\"build_time\\\":\\\"Thu Dec 22 09:08:37 UTC 2022\\\",\\\"used_go_version\\\":\\\"go version go1.18.3 linux/amd64\\\"},\\\"created_time\\\":\\\"2023-01-05 03:32:29.617158198 +0000 UTC m=+0.166727733\\\",\\\"updated_time\\\":\\\"2023-01-05 03:32:29.617158286 +0000 UTC m=+0.166727822\\\",\\\"type\\\":\\\"datanode\\\",\\\"id\\\":116,\\\"system_configurations\\\":{\\\"flush_insert_buffer_size\\\":16777216},\\\"quota_metrics\\\":{\\\"Hms\\\":{\\\"ip\\\":\\\"10.0.0.72:21124\\\",\\\"cpu_core_count\\\":8,\\\"cpu_core_usage\\\":0.8760951187187002,\\\"memory\\\":16332128256,\\\"memory_usage\\\":1879347200,\\\"disk\\\":104857600,\\\"disk_usage\\\":2097152},\\\"Rms\\\":[{\\\"Label\\\":\\\"InsertConsumeThroughput\\\",\\\"Rate\\\":0},{\\\"Label\\\":\\\"DeleteConsumeThroughput\\\",\\\"Rate\\\":0}],\\\"Fgm\\\":{\\\"MinFlowGraphChannel\\\":\\\"by-dev-rootcoord-dml_104_438376139208359401v15\\\",\\\"MinFlowGraphTt\\\":438537975051321348,\\\"NumFlowGraph\\\":72}}}]},\\\"connections\\\":{\\\"name\\\":\\\"datacoord69\\\",\\\"connected_components\\\":[]}}\" component_name:\"datacoord69\" "] []
[2023/01/05 03:33:34.862 +00:00] [DEBUG] [datacoord/meta.go:1196] ["UpdateChannelCheckpoint done"] [vChannel=by-dev-rootcoord-dml_108_438376139208359401v19] [time=2023/01/05 03:31:57.751 +00:00]
[2023/01/05 03:33:35.087 +00:00] [INFO] [datacoord/services.go:397] ["receive SaveBinlogPaths request"] [nodeID=116] [collectionID=438376139145806781] [segmentID=438376139150028783] [isFlush=false] [isDropped=false] [startPositions="[{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139149828714},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028775},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREIcQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-31-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381357996703747},\"segmentID\":438376139150028774},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028783},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028788}]"] [checkpoints="[{\"segmentID\":438376139150028783,\"position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJsQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358101823495},\"num_of_rows\":267}]"]
[2023/01/05 03:33:35.087 +00:00] [ERROR] [datacoord/services.go:412] ["failed to get segment"] [segmentID=438376139150028783] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:412\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:309\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5980\ngithub.com/milvus-io/milvus/internal/util/logutil.UnaryTraceLoggerInterceptor\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/grpc_interceptor.go:22\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5982\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:922"]
[2023/01/05 03:33:35.113 +00:00] [INFO] [datacoord/services.go:397] ["receive SaveBinlogPaths request"] [nodeID=116] [collectionID=438376139145806781] [segmentID=438376139150028788] [isFlush=false] [isDropped=false] [startPositions="[{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028788},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139149828714},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028775},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREIcQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-31-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381357996703747},\"segmentID\":438376139150028774},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028783}]"] [checkpoints="[{\"segmentID\":438376139150028788,\"position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJsQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358101823495},\"num_of_rows\":525}]"]
[2023/01/05 03:33:35.113 +00:00] [INFO] [datacoord/services.go:397] ["receive SaveBinlogPaths request"] [nodeID=116] [collectionID=438376139145806781] [segmentID=438376139150028774] [isFlush=false] [isDropped=false] [startPositions="[{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028775},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREIcQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-31-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381357996703747},\"segmentID\":438376139150028774},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028783},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028788},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139149828714}]"] [checkpoints="[{\"segmentID\":438376139150028774,\"position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"num_of_rows\":260}]"]
[2023/01/05 03:33:35.113 +00:00] [ERROR] [datacoord/services.go:412] ["failed to get segment"] [segmentID=438376139150028788] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:412\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:309\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5980\ngithub.com/milvus-io/milvus/internal/util/logutil.UnaryTraceLoggerInterceptor\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/grpc_interceptor.go:22\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5982\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:922"]
[2023/01/05 03:33:35.113 +00:00] [ERROR] [datacoord/services.go:412] ["failed to get segment"] [segmentID=438376139150028774] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:412\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:309\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5980\ngithub.com/milvus-io/milvus/internal/util/logutil.UnaryTraceLoggerInterceptor\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/grpc_interceptor.go:22\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5982\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:922"]

ThomasAlxDmy avatar Jan 05 '23 03:01 ThomasAlxDmy

Yep it seems 438376139142926320 is dropped. but for some reason we are still handling data in datanode memory.

Use https://github.com/milvus-io/birdwatcher to repair checkpoint to the latest might help you to fix the problem, I will take a look into why this is happening

xiaofan-luan avatar Jan 05 '23 03:01 xiaofan-luan

/assign @wayblink I'm pretty sure this is due to collection dropped but checkpoint is not pushed forward. Could you take a look into it?

xiaofan-luan avatar Jan 05 '23 03:01 xiaofan-luan

Following up this thread, as we are facing the same issue.

jeet129 avatar Jan 17 '23 05:01 jeet129

Following up this thread, as we are facing the same issue.

Hi, could you offer your logs? We will look into it.

Following up this thread, as we are facing the same issue.

Get it from colleague, working on it.

wayblink avatar Jan 17 '23 06:01 wayblink

Use the birdwatcher to fix the checkpoint of a collection:

  1. install birdwatcher
git clone https://github.com/milvus-io/birdwatcher
cd birdwatcher
go build -o birdwatcher main.go
  1. start birdwatcher cmd program and connect to etcd ./birdwatcher connect --etcd localhost:2379 --rootPath by-dev 0_connect

  2. show all collections show collections 1_show_collections

  3. use the dropped collection's id to show checkpoint show checkpoint --collection 438812796177350661 1_show_cp0

  4. fix checkpoint of the channels of the dropped collection repair checkpoint --collection 438812796177350661 --vchannel by-dev-rootcoord-dml_2_438812796177350661v0 --mq_type pulsar --address pulsar://localhost:6650 --set_to latest-msgid 2_repair_cp1 repair checkpoint --collection 438812796177350661 --vchannel by-dev-rootcoord-dml_3_438812796177350661v1 --mq_type pulsar --address pulsar://localhost:6650 --set_to latest-msgid 2_repair_cp2

  5. show checkpoint of the dropped collection to make sure it has been changed show checkpoint --collection 438812796177350661 3_show_cp1

yhmo avatar Jan 17 '23 07:01 yhmo

bw_etcd_ALL.230117-160040.bak.gz

For Walmart issue, the collection is not dropped and the segment meta is lost. Datanode panic when trying to SaveBinlogPath. Note that it is a new segment, not compacted one. Former datacoord log is cleaned so we don't know why the segment meta disappear...

wayblink avatar Jan 18 '23 08:01 wayblink

this problem should be fixed by @longjiquan

The reason is data GCed segment after rootcoord drop segment.

xiaofan-luan avatar Jan 18 '23 11:01 xiaofan-luan

The fix pr is: Remove collection after GC finished

yhmo avatar Jan 19 '23 02:01 yhmo

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Mar 17 '23 04:03 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Aug 02 '23 14:08 stale[bot]

i believe this was already fixed in 2.2.12

yanliang567 avatar Aug 04 '23 07:08 yanliang567

Confirmed it's fixed!

ThomasAlxDmy avatar Sep 05 '23 19:09 ThomasAlxDmy