milvus
milvus copied to clipboard
[Bug]: Data node crash loop on missing data
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.2.2
- Deployment mode(standalone or cluster): cluster (ansible)
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.1.0.dev98
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Data node keep crashing, not being able to back up data to MinIO, same errors happen after each restart. Is there a way I can mitigate this?
Not sure what this error means? it's missing but somehow referenced?
Expected Behavior
Restart docker container and error would go away? Any Settings we can use skip missing data?
Steps To Reproduce
No response
Milvus Log
[2023/01/05 02:32:49.057 +00:00] [DEBUG] [datanode/flow_graph_dd_node.go:329] ["DDNode sent delta timeTick"] [collectionID=438376139153874054] [ts=438537019444953089] [ts_p=2023/01/05 02:32:48.952 +00:00] [channel=by-dev-rootcoord-dml_40_438376139153874054v0]
[2023/01/05 02:32:49.075 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028903]
[2023/01/05 02:32:49.075 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028903] [error="failed to get segment 438376139150028903"]
[2023/01/05 02:32:49.075 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028846]
[2023/01/05 02:32:49.075 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028846] [error="failed to get segment 438376139150028846"]
[2023/01/05 02:32:49.094 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028968]
[2023/01/05 02:32:49.094 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028968] [error="failed to get segment 438376139150028968"]
[2023/01/05 02:32:49.094 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028791]
[2023/01/05 02:32:49.094 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028791] [error="failed to get segment 438376139150028791"]
[2023/01/05 02:32:49.129 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028995]
[2023/01/05 02:32:49.129 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028995] [error="failed to get segment 438376139150028995"]
[2023/01/05 02:32:49.151 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028972]
[2023/01/05 02:32:49.151 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028972] [error="failed to get segment 438376139150028972"]
[2023/01/05 02:32:49.151 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028896]
[2023/01/05 02:32:49.151 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028896] [error="failed to get segment 438376139150028896"]
[2023/01/05 02:32:49.174 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150029035]
[2023/01/05 02:32:49.174 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150029035] [error="failed to get segment 438376139150029035"]
[2023/01/05 02:32:49.174 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028998]
[2023/01/05 02:32:49.174 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028998] [error="failed to get segment 438376139150028998"]
[2023/01/05 02:32:49.185 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028950]
[2023/01/05 02:32:49.185 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028950] [error="failed to get segment 438376139150028950"]
[2023/01/05 02:32:49.211 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028900]
[2023/01/05 02:32:49.211 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028900] [error="failed to get segment 438376139150028900"]
[2023/01/05 02:32:49.341 +00:00] [WARN] [datanode/flush_manager.go:826] ["stale segment not found, could be compacted"] ["segment ID"=438376139150028922]
[2023/01/05 02:32:49.341 +00:00] [WARN] [datanode/flush_manager.go:828] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028922] [error="failed to get segment 438376139150028922"]
[2023/01/05 02:32:51.771 +00:00] [WARN] [datanode/flush_manager.go:849] ["failed to SaveBinlogPaths"] ["segment ID"=438376139150028774] [error="All attempts results:\nattempt #1:segment 438376139150028774 not found\nattempt #2:segment 438376139150028774 not found\nattempt #3:segment 438376139150028774 not found\nattempt #4:segment 438376139150028774 not found\nattempt #5:segment 438376139150028774 not found\nattempt #6:segment 438376139150028774 not found\nattempt #7:segment 438376139150028774 not found\nattempt #8:segment 438376139150028774 not found\nattempt #9:segment 438376139150028774 not found\nattempt #10:segment 438376139150028774 not found\nattempt #11:segment 438376139150028774 not found\n"]
panic: All attempts results:
attempt #1:segment 438376139150028774 not found
attempt #2:segment 438376139150028774 not found
attempt #3:segment 438376139150028774 not found
attempt #4:segment 438376139150028774 not found
attempt #5:segment 438376139150028774 not found
attempt #6:segment 438376139150028774 not found
attempt #7:segment 438376139150028774 not found
attempt #8:segment 438376139150028774 not found
attempt #9:segment 438376139150028774 not found
attempt #10:segment 438376139150028774 not found
attempt #11:segment 438376139150028774 not found
goroutine 18651 [running]:
github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc004230b90)
/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:853 +0x1296
github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc000c36180, 0xc0012d4840, 0xc0044acd30)
/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204 +0xdb
created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1
/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:121 +0xf1
Anything else?
No response
https://github.com/milvus-io/milvus/blob/master/internal/datanode/flush_manager.go#L851-L852
@congqixia you might know about this, any idea of what the problem is and how it can be mitigated?
Hi @ThomasAlxDmy Could you offer logs about your datacoord and datanode? My guess is those segments are from a dropped collection, if for that case we can simply ignore the error than abort the service.
/assign @xiaofan-luan
log.Info("SaveBinlogPath",
zap.Int64("SegmentID", pack.segmentID),
zap.Int64("CollectionID", dsService.collectionID),
zap.Any("startPos", startPos),
zap.Any("checkPoints", checkPoints),
zap.Int("Length of Field2BinlogPaths", len(fieldInsert)),
zap.Int("Length of Field2Stats", len(fieldStats)),
zap.Int("Length of Field2Deltalogs", len(deltaInfos[0].GetBinlogs())),
zap.String("vChannelName", dsService.vchannelName),
)
from this log you will know the collectionID of these segments, and datacoord log will show if the collection is dropped. from the log your offered it only shows warn message, is there a way to show more?
If it's from a drop collection, One quick fix to let it work is to use birdwatcher tool to push forward checkpoint.(May loose some streaming data if you have other collection data in the stream)
sounds about right:
Here's the logs from data-coordinator node:
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143574409]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139142948494]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143363880]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143782385]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143363706]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143984223]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143988047]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143155733]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143369471]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143783160]
[2023/01/05 03:33:34.390 +00:00] [INFO] [datacoord/util.go:158] ["no vector index for the segment"] [collectionID=438376139142926320] [segmentID=438376139143575036]
[2023/01/05 03:33:34.391 +00:00] [WARN] [datacoord/handler.go:310] ["failed to load collection from rootcoord"] [collectionID=438376139142926320] [error="code: CollectionNotExists, reason: can't find collection: 438376139142926320"]
[2023/01/05 03:33:34.391 +00:00] [WARN] [datacoord/compaction_trigger.go:349] ["get compact time failed, skip to handle compaction"] [collectionID=438376139142926320] [partitionID=438376139142926321] [channel=by-dev-rootcoord-dml_19_438376139142926320v1]
[2023/01/05 03:33:34.558 +00:00] [DEBUG] [datacoord/services.go:877] [DataCoord.GetMetrics] [node_id=69] [req="{\"metric_type\":\"system_info\"}"] [metric_type=system_info] [metrics="status:<> response:\"{\\\"cluster\\\":{\\\"self\\\":{\\\"has_error\\\":false,\\\"error_reason\\\":\\\"\\\",\\\"name\\\":\\\"datacoord69\\\",\\\"hardware_infos\\\":{\\\"ip\\\":\\\"10.0.0.249:13333\\\",\\\"cpu_core_count\\\":4,\\\"cpu_core_usage\\\":4.865771812735999,\\\"memory\\\":16818016256,\\\"memory_usage\\\":105013248,\\\"disk\\\":104857600,\\\"disk_usage\\\":2097152},\\\"system_info\\\":{\\\"system_version\\\":\\\"0fdc1a04\\\",\\\"deploy_mode\\\":\\\"DISTRIBUTED\\\",\\\"build_version\\\":\\\"v2.2.2\\\",\\\"build_time\\\":\\\"Thu Dec 22 09:08:37 UTC 2022\\\",\\\"used_go_version\\\":\\\"go version go1.18.3 linux/amd64\\\"},\\\"created_time\\\":\\\"2023-01-05 01:54:34.339885567 +0000 UTC m=+23.456467038\\\",\\\"updated_time\\\":\\\"2023-01-05 01:54:34.339887409 +0000 UTC m=+23.456468816\\\",\\\"type\\\":\\\"datacoord\\\",\\\"id\\\":69,\\\"system_configurations\\\":{\\\"segment_max_size\\\":512},\\\"quota_metrics\\\":{\\\"TotalBinlogSize\\\":7543221972}},\\\"connected_nodes\\\":[{\\\"has_error\\\":false,\\\"error_reason\\\":\\\"\\\",\\\"name\\\":\\\"datanode116\\\",\\\"hardware_infos\\\":{\\\"ip\\\":\\\"10.0.0.72:21124\\\",\\\"cpu_core_count\\\":8,\\\"cpu_core_usage\\\":0.8760951187187002,\\\"memory\\\":16332128256,\\\"memory_usage\\\":1879347200,\\\"disk\\\":104857600,\\\"disk_usage\\\":2097152},\\\"system_info\\\":{\\\"system_version\\\":\\\"0fdc1a04\\\",\\\"deploy_mode\\\":\\\"DISTRIBUTED\\\",\\\"build_version\\\":\\\"v2.2.2\\\",\\\"build_time\\\":\\\"Thu Dec 22 09:08:37 UTC 2022\\\",\\\"used_go_version\\\":\\\"go version go1.18.3 linux/amd64\\\"},\\\"created_time\\\":\\\"2023-01-05 03:32:29.617158198 +0000 UTC m=+0.166727733\\\",\\\"updated_time\\\":\\\"2023-01-05 03:32:29.617158286 +0000 UTC m=+0.166727822\\\",\\\"type\\\":\\\"datanode\\\",\\\"id\\\":116,\\\"system_configurations\\\":{\\\"flush_insert_buffer_size\\\":16777216},\\\"quota_metrics\\\":{\\\"Hms\\\":{\\\"ip\\\":\\\"10.0.0.72:21124\\\",\\\"cpu_core_count\\\":8,\\\"cpu_core_usage\\\":0.8760951187187002,\\\"memory\\\":16332128256,\\\"memory_usage\\\":1879347200,\\\"disk\\\":104857600,\\\"disk_usage\\\":2097152},\\\"Rms\\\":[{\\\"Label\\\":\\\"InsertConsumeThroughput\\\",\\\"Rate\\\":0},{\\\"Label\\\":\\\"DeleteConsumeThroughput\\\",\\\"Rate\\\":0}],\\\"Fgm\\\":{\\\"MinFlowGraphChannel\\\":\\\"by-dev-rootcoord-dml_104_438376139208359401v15\\\",\\\"MinFlowGraphTt\\\":438537975051321348,\\\"NumFlowGraph\\\":72}}}]},\\\"connections\\\":{\\\"name\\\":\\\"datacoord69\\\",\\\"connected_components\\\":[]}}\" component_name:\"datacoord69\" "] []
[2023/01/05 03:33:34.862 +00:00] [DEBUG] [datacoord/meta.go:1196] ["UpdateChannelCheckpoint done"] [vChannel=by-dev-rootcoord-dml_108_438376139208359401v19] [time=2023/01/05 03:31:57.751 +00:00]
[2023/01/05 03:33:35.087 +00:00] [INFO] [datacoord/services.go:397] ["receive SaveBinlogPaths request"] [nodeID=116] [collectionID=438376139145806781] [segmentID=438376139150028783] [isFlush=false] [isDropped=false] [startPositions="[{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139149828714},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028775},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREIcQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-31-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381357996703747},\"segmentID\":438376139150028774},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028783},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028788}]"] [checkpoints="[{\"segmentID\":438376139150028783,\"position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJsQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358101823495},\"num_of_rows\":267}]"]
[2023/01/05 03:33:35.087 +00:00] [ERROR] [datacoord/services.go:412] ["failed to get segment"] [segmentID=438376139150028783] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:412\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:309\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5980\ngithub.com/milvus-io/milvus/internal/util/logutil.UnaryTraceLoggerInterceptor\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/grpc_interceptor.go:22\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5982\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:922"]
[2023/01/05 03:33:35.113 +00:00] [INFO] [datacoord/services.go:397] ["receive SaveBinlogPaths request"] [nodeID=116] [collectionID=438376139145806781] [segmentID=438376139150028788] [isFlush=false] [isDropped=false] [startPositions="[{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028788},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139149828714},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028775},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREIcQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-31-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381357996703747},\"segmentID\":438376139150028774},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028783}]"] [checkpoints="[{\"segmentID\":438376139150028788,\"position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJsQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358101823495},\"num_of_rows\":525}]"]
[2023/01/05 03:33:35.113 +00:00] [INFO] [datacoord/services.go:397] ["receive SaveBinlogPaths request"] [nodeID=116] [collectionID=438376139145806781] [segmentID=438376139150028774] [isFlush=false] [isDropped=false] [startPositions="[{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028775},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREIcQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-31-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381357996703747},\"segmentID\":438376139150028774},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028783},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139150028788},{\"start_position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"segmentID\":438376139149828714}]"] [checkpoints="[{\"segmentID\":438376139150028774,\"position\":{\"channel_name\":\"by-dev-rootcoord-dml_26_438376139145806781v2\",\"msgID\":\"CKcREJEQGAAgAA==\",\"msgGroup\":\"by-dev-dataNode-116-by-dev-rootcoord-dml_26_438376139145806781v2\",\"timestamp\":438381358049394691},\"num_of_rows\":260}]"]
[2023/01/05 03:33:35.113 +00:00] [ERROR] [datacoord/services.go:412] ["failed to get segment"] [segmentID=438376139150028788] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:412\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:309\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5980\ngithub.com/milvus-io/milvus/internal/util/logutil.UnaryTraceLoggerInterceptor\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/grpc_interceptor.go:22\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5982\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:922"]
[2023/01/05 03:33:35.113 +00:00] [ERROR] [datacoord/services.go:412] ["failed to get segment"] [segmentID=438376139150028774] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:412\ngithub.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).SaveBinlogPaths\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:309\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5980\ngithub.com/milvus-io/milvus/internal/util/logutil.UnaryTraceLoggerInterceptor\n\t/go/src/github.com/milvus-io/milvus/internal/util/logutil/grpc_interceptor.go:22\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/internal/proto/datapb._DataCoord_SaveBinlogPaths_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:5982\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:922"]
Yep it seems 438376139142926320 is dropped. but for some reason we are still handling data in datanode memory.
Use https://github.com/milvus-io/birdwatcher to repair checkpoint to the latest might help you to fix the problem, I will take a look into why this is happening
/assign @wayblink I'm pretty sure this is due to collection dropped but checkpoint is not pushed forward. Could you take a look into it?
Following up this thread, as we are facing the same issue.
Following up this thread, as we are facing the same issue.
Hi, could you offer your logs? We will look into it.
Following up this thread, as we are facing the same issue.
Get it from colleague, working on it.
Use the birdwatcher to fix the checkpoint of a collection:
- install birdwatcher
git clone https://github.com/milvus-io/birdwatcher
cd birdwatcher
go build -o birdwatcher main.go
-
start birdwatcher cmd program and connect to etcd
./birdwatcher
connect --etcd localhost:2379 --rootPath by-dev
-
show all collections
show collections
-
use the dropped collection's id to show checkpoint
show checkpoint --collection 438812796177350661
-
fix checkpoint of the channels of the dropped collection
repair checkpoint --collection 438812796177350661 --vchannel by-dev-rootcoord-dml_2_438812796177350661v0 --mq_type pulsar --address pulsar://localhost:6650 --set_to latest-msgid
repair checkpoint --collection 438812796177350661 --vchannel by-dev-rootcoord-dml_3_438812796177350661v1 --mq_type pulsar --address pulsar://localhost:6650 --set_to latest-msgid
-
show checkpoint of the dropped collection to make sure it has been changed
show checkpoint --collection 438812796177350661
For Walmart issue, the collection is not dropped and the segment meta is lost. Datanode panic when trying to SaveBinlogPath. Note that it is a new segment, not compacted one. Former datacoord log is cleaned so we don't know why the segment meta disappear...
this problem should be fixed by @longjiquan
The reason is data GCed segment after rootcoord drop segment.
The fix pr is: Remove collection after GC finished
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
i believe this was already fixed in 2.2.12
Confirmed it's fixed!