milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: service exited with a WARN [datanode/flush_manager.go:853] ["failed to SaveBinlogPaths"]

Open fall-nebula opened this issue 1 year ago • 5 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.2.5
- Deployment mode(standalone or cluster): standalone 
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

I found milvus-standalone exited from docker compose.

# docker compose ps
NAME                COMMAND                  SERVICE             STATUS              PORTS
milvus-etcd         "etcd -advertise-cli…"   etcd                running             2379-2380/tcp
milvus-minio        "/usr/bin/docker-ent…"   minio               running (healthy)   0.0.0.0:9000->9000/tcp, :::9000->9000/tcp
milvus-standalone   "/tini -- milvus run…"   standalone          exited (2)

It work nearly three days from startup to exited. I don't think I did anything special to the program other than use the SDK's compact data manually method. I just followed the official documentation and used the milvus-sdk-java 2.2.4 to operate Milvus. For example, manage collections, manage Partitions, manage data. I drop a partition with 2000 rows of data 3 days ago. Before the program exited, I inserted several pieces of data from Minio in bulk and used the SDK to delete some of them with primary key. Finally, there were about 2000 pieces of data left. I tried the compression method provided by the SDK and wanted to see the effect.

        final ManualCompactParam param = ManualCompactParam.newBuilder()
                .withCollectionName(collectionName)
                .build();
        final R<ManualCompactionResponse> responseR = milvusClient.manualCompact(param);
        return responseR.getData().getCompactionID();

Then I found that the service exits out of docker compose. And I'm not sure if that's what caused it.

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

[2023/04/06 03:22:21.589 +00:00] [DEBUG] [rootcoord/task.go:68] ["tr/ddl request"] [msg="enqueue done"] [duration=6.672µs] [2023/04/06 03:22:21.589 +00:00] [DEBUG] [rootcoord/task.go:70] ["tr/ddl request"] [msg="start to process"] [duration=92.53µs] [2023/04/06 03:22:21.590 +00:00] [DEBUG] [etcd/etcd_kv.go:769] [tr/getEtcdMeta] [msg="getEtcdMeta done"] [duration=1.336416ms] [2023/04/06 03:22:21.593 +00:00] [DEBUG] [proxy/impl.go:3918] [Proxy.GetProxyMetrics] [traceID=7345293fd9c6c8d2] [nodeID=8] [req="{"metric_type":"system_info"}"] [metricType=system_info] [2023/04/06 03:22:21.598 +00:00] [INFO] [proxy/impl.go:4748] ["current rates in proxy"] [proxyNodeID=8] [rates="[{"rt":7,"r":1.7976931348623157e+308},{"rt":9,"r":1.7976931348623157e+308},{"rt":5,"r":1.7976931348623157e+308},{"rt":6,"r":1.7976931348623157e+308},{"rt":8,"r":1.7976931348623157e+308}]"] [2023/04/06 03:22:21.645 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=174.646µs] [2023/04/06 03:22:21.645 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=261.492µs] [2023/04/06 03:22:21.645 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=226.429µs] [2023/04/06 03:22:21.645 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=413.232µs] [2023/04/06 03:22:21.645 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=136.745µs] [2023/04/06 03:22:21.646 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=304.659µs] [2023/04/06 03:22:21.646 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=729.799µs] [2023/04/06 03:22:21.646 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=682.875µs] [2023/04/06 03:22:21.646 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=107.259µs] [2023/04/06 03:22:21.646 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=67.294µs] [2023/04/06 03:22:21.646 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=133.503µs] [2023/04/06 03:22:21.646 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=1.418087ms] [2023/04/06 03:22:21.647 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=411.959µs] [2023/04/06 03:22:21.647 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=564.66µs] [2023/04/06 03:22:21.647 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=435.996µs] [2023/04/06 03:22:21.647 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=873.59µs] [2023/04/06 03:22:21.745 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=234.509µs] [2023/04/06 03:22:21.745 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=379.266µs] [2023/04/06 03:22:21.788 +00:00] [DEBUG] [rootcoord/task.go:68] ["tr/ddl request"] [msg="enqueue done"] [duration=3.573µs] [2023/04/06 03:22:21.788 +00:00] [DEBUG] [rootcoord/task.go:70] ["tr/ddl request"] [msg="start to process"] [duration=81.48µs] [2023/04/06 03:22:21.831 +00:00] [WARN] [datanode/flush_manager.go:853] ["failed to SaveBinlogPaths"] ["segment ID"=440556259765922844] [error="All attempts results:\nattempt #1:segment 440556259765922844 not found\nattempt #2:segment 440556259765922844 not found\nattempt #3:segment 440556259765922844 not found\nattempt #4:segment 440556259765922844 not found\nattempt #5:segment 440556259765922844 not found\nattempt #6:segment 440556259765922844 not found\nattempt #7:segment 440556259765922844 not found\nattempt #8:segment 440556259765922844 not found\nattempt #9:segment 440556259765922844 not found\nattempt #10:segment 440556259765922844 not found\nattempt #11:segment 440556259765922844 not found\n"] milvus-standalone | panic: All attempts results: milvus-standalone | attempt #1:segment 440556259765922844 not found milvus-standalone | attempt #2:segment 440556259765922844 not found milvus-standalone | attempt #3:segment 440556259765922844 not found milvus-standalone | attempt #4:segment 440556259765922844 not found milvus-standalone | attempt #5:segment 440556259765922844 not found milvus-standalone | attempt #6:segment 440556259765922844 not found milvus-standalone | attempt #7:segment 440556259765922844 not found milvus-standalone | attempt #8:segment 440556259765922844 not found milvus-standalone | attempt #9:segment 440556259765922844 not found milvus-standalone | attempt #10:segment 440556259765922844 not found milvus-standalone | attempt #11:segment 440556259765922844 not found milvus-standalone | milvus-standalone | milvus-standalone | goroutine 19679713 [running]: milvus-standalone | github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc005c259f0) milvus-standalone | /go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:857 +0x13a8 milvus-standalone | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc008b07200, 0xc006fe5e60, 0xc005a40eb0) milvus-standalone | /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204 +0xdb milvus-standalone | created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1 milvus-standalone | /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:121 +0xf1

Anything else?

No response

fall-nebula avatar Apr 06 '23 06:04 fall-nebula

Drop partition only drop segment on DataCoord, but DataNode still load this segment. Delete or insert may trigger this dropped segment flush, and then panic.

aoiasd avatar Apr 06 '23 09:04 aoiasd

/assign @aoiasd /unassign

yanliang567 avatar Apr 06 '23 09:04 yanliang567

Drop partition only drop segment on DataCoord, but DataNode still load this segment. Delete or insert may trigger this dropped segment flush, and then panic.

drop partition doesn't need to release a collection? WTF

xiaofan-luan avatar Apr 10 '23 17:04 xiaofan-luan

@bigsheeper will this be solved in 2.3?

xiaofan-luan avatar Apr 10 '23 17:04 xiaofan-luan

@bigsheeper will this be solved in 2.3?

2.3 will not panic because SaveBinLog path will return as success if segment was dropped, but will still use more resources because DataNode still keep this dropped segments.

aoiasd avatar Apr 12 '23 10:04 aoiasd

Fixed.

aoiasd avatar May 22 '23 02:05 aoiasd

Has this issue been resolved? I still have this error in milvus version 2.2.10. Can you help me resolve it

xiaozhuang-a avatar Jul 29 '23 15:07 xiaozhuang-a

dataNode:

[2023/07/29 15:43:27.374 +00:00] [INFO] [datanode/flow_graph_insert_buffer_node.go:469] ["insertBufferNode syncing BufferData"] [segmentID=442776935130016392] [flushed=false] [dropped=false] [auto=false] [position="channel_name:\"by-dev-rootcoord-dml_444_442776935130016343v0\" msgID:\"\\010\\225\\227\\202\\001\\020\\214\\230\\001\\030\\000 \\000\" msgGroup:\"by-dev-dataNode-55-by-dev-rootcoord-dml_444_442776935130016343v0\" timestamp:443192549629755393 "] [channel=by-dev-rootcoord-dml_444_442776935130016343v0]
[2023/07/29 15:43:27.374 +00:00] [INFO] [datanode/flush_manager.go:287] ["handling insert task"] ["segment ID"=442776935130016392] [flushed=false] [dropped=false] [position="channel_name:\"by-dev-rootcoord-dml_444_442776935130016343v0\" msgID:\"\\010\\225\\227\\202\\001\\020\\214\\230\\001\\030\\000 \\000\" msgGroup:\"by-dev-dataNode-55-by-dev-rootcoord-dml_444_442776935130016343v0\" timestamp:443192549629755393 "]
[2023/07/29 15:43:27.374 +00:00] [INFO] [datanode/flush_manager.go:139] ["new flush task runner created and initialized"] ["segment ID"=442776935130016392] ["pos message ID"="\u0008\ufffd\ufffd\ufffd\u0001\u0010\ufffd\ufffd\u0001\u0018\u0000 \u0000"]
[2023/07/29 15:43:27.374 +00:00] [INFO] [datanode/flush_task.go:134] ["running flush insert task"] ["segment ID"=442776935130016392] [flushed=false] [dropped=false] [position="channel_name:\"by-dev-rootcoord-dml_444_442776935130016343v0\" msgID:\"\\010\\225\\227\\202\\001\\020\\214\\230\\001\\030\\000 \\000\" msgGroup:\"by-dev-dataNode-55-by-dev-rootcoord-dml_444_442776935130016343v0\" timestamp:443192549629755393 "] [PosTime=2023/07/29 15:43:27.218 +00:00]
[2023/07/29 15:43:27.374 +00:00] [INFO] [datanode/channel_meta.go:453] ["roll pk stats"] ["segment id"=442776935130016392]
[2023/07/29 15:43:27.374 +00:00] [INFO] [datanode/flow_graph_delete_node.go:118] ["DeleteNode receives flush message"] [segIDs="[442776935130016392]"] [vChannelName=by-dev-rootcoord-dml_444_442776935130016343v0] [posTime=2023/07/29 15:43:27.218 +00:00]
[2023/07/29 15:43:27.374 +00:00] [INFO] [datanode/flush_manager.go:314] ["handling delete task"] ["segment ID"=442776935130016392]
[2023/07/29 15:43:27.451 +00:00] [WARN] [datanode/flush_manager.go:853] ["failed to SaveBinlogPaths"] ["segment ID"=442776935130016392] [error="All attempts results:\nattempt #1:segment 442776935130016392 not found\nattempt #2:segment 442776935130016392 not found\nattempt #3:segment 442776935130016392 not found\nattempt #4:segment 442776935130016392 not found\nattempt #5:segment 442776935130016392 not found\nattempt #6:segment 442776935130016392 not found\nattempt #7:segment 442776935130016392 not found\nattempt #8:segment 442776935130016392 not found\nattempt #9:segment 442776935130016392 not found\nattempt #10:segment 442776935130016392 not found\nattempt #11:segment 442776935130016392 not found\n"]
panic: All attempts results:
attempt #1:segment 442776935130016392 not found
attempt #2:segment 442776935130016392 not found
attempt #3:segment 442776935130016392 not found
attempt #4:segment 442776935130016392 not found
attempt #5:segment 442776935130016392 not found
attempt #6:segment 442776935130016392 not found
attempt #7:segment 442776935130016392 not found
attempt #8:segment 442776935130016392 not found
attempt #9:segment 442776935130016392 not found
attempt #10:segment 442776935130016392 not found
attempt #11:segment 442776935130016392 not found


goroutine 6092910 [running]:
github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc09a773ae0)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:857 +0x13f6
github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc0c102e180, 0xc011be0a80, 0xc0a69764d0)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204 +0xdb
created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1
	/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:121 +0xf1

xiaozhuang-a avatar Jul 29 '23 15:07 xiaozhuang-a

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 04 '23 05:09 stale[bot]