milvus [Bug]: panic: failed to flush delete data

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: v2.1.0-hotfix-dcd6c9e
- Deployment mode(standalone or cluster): standalone
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.0.2
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: 
    Limits:
      cpu:     32
      memory:  16Gi
    Requests:
      cpu:      4
      memory:   16Gi
- GPU:
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
- Others: 
    Deployment is running on EKS

Current Behavior

When I deploy a fresh milvus standalone setup via the helm chart and we start inserting and deleting data after a short time (sometimes minutes, sometimes hours) the setup panics. The worst thing is, that once it panicked, it is not possible to get it running again. Only by stopping everything, deleting the PVCs and recreating an empty cluster from scratch without any data it is possible to get a running system. The system with inserted data cannot be revived.

We consume from kafka with a python kafka consumer and use the milvus python sdk to frequently insert and delete vectors with 100 - 300 dimensions (we experiment with different setups).

We were running on Milvus 1.1.1 and as we had some performance problems we recently upgraded to Milvus 2, we were trying out Milvus 2.0.2 , 2.1.0 and v2.1.0-hotfix-dcd6c9e , first in the clustered version and then as this failed in standalone mode.Our usecase is that we consume messages from kafka, vectorize them and then put them into Milvus via the python SDK, we also have a webservice which queries Milvus. The inserts are one insert every couple of seconds, so about 20-60 inserts into Milvus per Minute. After some minutes or hours the clustered setup failed. With querynodes going offline and not being able to come online again (the logs do not show any meaningful message).When I try to reproduce the issues locally with some dummy generators/searches and docker-compose it seems to work quite well, without errors, even with 500.000 vectors. But in the setup deployed in Kubernetes we get issues of all kinds (dying queryNode -> queryCoord, failed to find shared leader, etc.) . Also Standalone deployed on kubernetes does not run stable with very little load and vectors (20-60 inserts per Minute, about 1-10k vectors).

open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: failed to flush delete data, err = All attempts results:
attempt #1:cannot find segment, id = 435148680510046209
attempt #2:cannot find segment, id = 435148680510046209
attempt #3:cannot find segment, id = 435148680510046209
attempt #4:cannot find segment, id = 435148680510046209
attempt #5:cannot find segment, id = 435148680510046209


goroutine 1551 [running]:
github.com/milvus-io/milvus/internal/datanode.(*deleteNode).Operate(0xc000f0db90, 0xc0079068c0, 0x1, 0x1, 0x0, 0x0, 0x0)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_delete_node.go:249 +0x10c5
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc002408680)
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70

Other logs:

    __  _________ _   ____  ______    
   /  |/  /  _/ /| | / / / / / __/    
  / /|_/ // // /_| |/ / /_/ /\ \    
 /_/  /_/___/____/___/\____/___/     

Welcome to use Milvus!
Version:   v2.1.0-hotfix-dcd6c9e
Built:     Mon Aug  1 08:54:24 UTC 2022
GitCommit: dcd6c9e5
GoVersion: go version go1.16.9 linux/amd64

open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: insertNode processDeleteMessages failed, collectionID = 435147331058208065, err = partition 435148989329309697 hasn't been loaded or has been released, channel: by-dev-rootcoord-dml_1_435147331058208065v1

goroutine 191864 [running]:
github.com/milvus-io/milvus/internal/querynode.(*insertNode).Operate(0xc00b047170, 0xc00caef170, 0x1, 0x1, 0x0, 0x0, 0x0)
	/go/src/github.com/milvus-io/milvus/internal/querynode/flow_graph_insert_node.go:234 +0x3c45
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc005772780)
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70

milvus2-etcd-0                                                1/1     Running   0          19h
milvus2-minio-7f7556f4b7-xp7r8                                1/1     Running   0          18h
milvus2-standalone-c7bd4b9d9-gf8pj                            0/1     Running   51         19h

Expected Behavior

Milvus should run stable and not shutdown because of (recoverable?) errors
Milvus should be able to recover itself if it finds unconsistent state
Milvus should never end up in a state where it can't heal itself

Steps To Reproduce

1. Deploy milvus standalone on EKS cluster
2. Insert and delete vectors in parallel (same id can be inserted, deleted multiple times in relatively short succession (several seconds)) in a rate about 20-60 times per minute
3. After a short time some raceconditions (?) appear which bring down milvus
4. Milvus is not able to recover from this and restarts endlessly

Milvus Log

I have 60MB of logs (gzipped 2.3 MB) where can I drop them?

Some excerpt:

[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435148894957731843/435148894957731849] [error="file/delta_log/43514733105820806
5/435147415318167553/435148894957731843/435148894957731849 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).work\n\t/
go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go:98"]
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435149549579796488/435149549592903686] [error="file/delta_log/43514733105820806
5/435147415318167553/435149549579796488/435149549592903686 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/
...
[2022/08/09 04:38:00.026 +00:00] [DEBUG] [client/client.go:91] ["QueryCoordClient msess key not existed"] [key=querycoord]
[2022/08/09 04:38:00.026 +00:00] [ERROR] [grpcclient/client.go:140] ["failed to get client address"] [error="find no available querycoord, check querycoord state"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).connect\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:140\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).GetGrpcClient\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:112\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).callOnce\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:202\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:257\ngithub.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).GetComponentStates\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:118\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:65\ngithub.com/milvus-io/milvus/internal/util/retry.Do\n\t/go/src/github.com/milvus-io/milvus/internal/util/retry/retry.go:37\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:89\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentHealthy\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:104\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:461\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:288\ngithub.com/milvus-io/milvus/cmd/components.(*Proxy).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/proxy.go:50\ngithub.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).runProxy.func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:155"]

Aug 09 '22 06:08 datenhahn

@xiaofan-luan could you please take a look at this issue? /assign @xiaofan-luan /unassign

Aug 09 '22 06:08 yanliang567

/assign @XuanYang-cn pls help on it

Aug 09 '22 07:08 xiaofan-luan

@xiaofan-luan: GitHub didn't allow me to assign the following users: pls, help, on, it.

Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @XuanYang-cn pls help on it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 09 '22 07:08 sre-ci-robot

Milvus Log

I have 60MB of logs (gzipped 2.3 MB) where can I drop them?

@datenhahn I would like to exam the logs, please email to me at [email protected]

Aug 09 '22 07:08 XuanYang-cn

[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435148894957731843/435148894957731849] [error="file/delta_log/43514733105820806
5/435147415318167553/435148894957731843/435148894957731849 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).work\n\t/
go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go:98"]
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435149549579796488/435149549592903686] [error="file/delta_log/43514733105820806
5/435147415318167553/435149549579796488/435149549592903686 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/

This's a bug need to fix, but it doesn't influence the recovery process and doesn't cause panic, so I'll leave an issue #18585 and keep investigating.

Aug 10 '22 07:08 XuanYang-cn

open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: failed to flush delete data, err = All attempts results:
attempt #1:cannot find segment, id = 435148680510046209
attempt #2:cannot find segment, id = 435148680510046209
attempt #3:cannot find segment, id = 435148680510046209
attempt #4:cannot find segment, id = 435148680510046209
attempt #5:cannot find segment, id = 435148680510046209


goroutine 1551 [running]:
github.com/milvus-io/milvus/internal/datanode.(*deleteNode).Operate(0xc000f0db90, 0xc0079068c0, 0x1, 0x1, 0x0, 0x0, 0x0)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_delete_node.go:249 +0x10c5
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc002408680)
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70

@datenhahn could you email me the logs containing this part?

Aug 10 '22 07:08 XuanYang-cn

@XuanYang-cn I have mailed you the other logs, I didn't realize the logs have been rotated, so the new file is 300MB big.

Maybe some more information about our setup which might be special. We use lots of partitions to speed up searches. We add every vector in multiple partitions:

Year (e.g. 2022) Month (e.g. 2022-08) Week (e.g. 2022-08-KW32) Day (e.g. 2022-08-10)

So we have lots of partitions with (initially) few vectors, the more vectors we add, the partitions fill up with more data.

Aug 10 '22 07:08 datenhahn

@datenhahn I didn't receive the 300MB log, could you please resend me one? I'll try to reproduce this panic locally following your usage information while waiting for the logs.

I have some questions about usage details, looking forward to your reply:

Did you load collection or load some of the partitions?
Did you call num_entities every time there's an insert/delete?
How did you Query or Search after insert? Was it Query by ID or Search with limits?

Now I'm trying to figure these problems:

What caused datanode panic, lack of information
What caused querynode panic, lack of information too
What made querycoord unable to recover, maybe related to the cluster #18595

Aug 11 '22 03:08 XuanYang-cn

@XuanYang-cn I have uploaded the logs here now, maybe that works: https://drive.google.com/file/d/1TJYGAx_H2emOcF0XQruhlozDzGI9008I/view?usp=sharing

We loaded the collection (I was not aware it is possible or required to load partitions separately) and collection was shown in attu as "loaded"
Yes, I just saw we used it as debugging information

            logger.info(f"Number of all inserted vectors: {inserted_vectors.insert_count}.")
            logger.info(
                f"Vectors {inserted_vectors.primary_keys} successfully added to partition {partition} of collection {collection.name}")
            logger.info(
                f"New number of entities in collection {collection.name}:\n{collection.num_entities}")

The delete uses a "id in ..." expression, the search uses vectors with limit. The delete use the following expression:

        check_if_id_exists = f"id in {[id]}"
        deleted_entity = collection.delete(expr=check_if_id_exists, partition_name=partition)

The queries use vector search:

@run_time
def query_milvus(query_vectors, request, search_partitions, path=""):
    """Return the vectors of documents that are most close to the query vector"""

    list_of_query_vectors = [dict_elem["embeddings"] if isinstance(dict_elem, dict) else dict_elem for dict_elem in query_vectors]

    query_results = collection.search(
        data=list_of_query_vectors,
        anns_field="embeddings",
        param=search_params,
        partition_names=search_partitions,
        limit=request.num_similar_documents
    )
    logger.info(f"Milvus Suche beendet. Ergebnisse: {query_results}")
    return query_results

Regarding the problems:

DataNode might have been killed by the kubernetes cluster (e.g. we are using AWS Spot Instances)

Regarding the rest, maybe you find something in the logfiles.

Here are event more logs in the other issue ( https://github.com/milvus-io/milvus/issues/18595#issuecomment-1210780825 ) it is the same setup. I tried with Milvus Cluster and Milvus Standalone on Kubernetes, but both run into issues.

For me I see the following problems:

If milvus is running into ressource issues it should block/reject further insertion/deletion requests, but it should not completely shutdown
If milvus is entering an invalid cluster state (e.g. important services like datanode, etc. missing) it should not panic, but continue logging/retrying until the resource is available again
If milvus for whatever reason dies from panics or because AWS or Kubernetes ripped some cluster nodes away, it should be able to recover itself into some workable/debuggable state
If the damage done to the cluster is too large and unhealable (e.g. permanent data loss, etc.) it can either completely shut down with meaningful messages or it can recover with the data still available (danger of not noticing that there was a data loss)
One note: It might be that liveness probes prevent the cluster getting in valid states if recovery takes longer than liveness probes and kubernetes marks the pod as dead and terminates and recreates it introducing and endless recovery/restart loop. But while the restart loop this seems to happen here, I don't think this is an issue. It is not an issue of time currently, as queryCoord never will recover (for a test to see if it needs more time I was increasing the liveness probe retries to a very high number).

Currently I have the feeling:

Once resource limits are reached (memory, queuesizes, whatever) milvus will just panic
Once paniced it cannot pull itself together anymore due to inconsistent state
The required sizings for many vectors/partitions and required config settings (e.g. queuesizes, memory sizes, config settings) are not clear, maybe we need to configure something differently?

Aug 11 '22 12:08 datenhahn

Hi @datenhahn , thanks for the very detailed feedback. The problems you point out is exactly what we are trying to deliver on Milvus 2.2:

Some limitation of data ingestion rate and data size to avoid system crash
instead of panic, processing some of the requests asyncly and retry until the service recovery
Protection and better scheduling on search requests.
Refactor meta to avoid inconsistency(especially when failure happens) ....... Stabiltiy will be the first priority of the community~ If you'd like to help that would be great because we love your suggestions

Aug 15 '22 06:08 xiaofan-luan

@datenhahn feel free to reach me out at [email protected]~ I'd like to offer more help

Aug 15 '22 07:08 xiaofan-luan

Why DataNode panics at delete_buffer_node and how we're going to fix it.

insert_buffer_node process flush message 200ms(tt) per segment, when there's a large number of flushed segments arrived, it takes a very very long time to pass the flush message.
There're lots of small segments, tens of thousands of segment with num_rows==1, generating many compactions
When compaction happens after flush yet the compaction is executed faster than flush, delete_node will panic.

347         // Manual Flush                                                                                 
348         select {                            
349         case fmsg := <-ibNode.flushChan:                                                                
350             log.Info("(Manual Flush) receiving flush message",                                          
352                 zap.Int64("collectionID", fmsg.collectionID),                                                                       
...

for i, msg := range fgMsg.deleteMessages {                                                                
         tmpSegIDs, err := dn.bufferDeleteMsg(msg, fgMsg.timeRange)
}                           
...                                                    

func (dn *deleteNode) bufferDeleteMsg(msg *msgstream.DeleteMsg, tr TimeRange) ([]UniqueID, error) {           
...                                                                    
    // Update delBuf for merged segments                                                                      
    compactedTo2From := dn.replica.listCompactedSegmentIDs()                                                  
    for compactedTo, compactedFrom := range compactedTo2From {                                                
        compactToDelBuff := newDelDataBuf()                                                                   
        for _, segID := range compactedFrom {                                                                 
           value, loaded := dn.delBuf.LoadAndDelete(segID)                                                   
           if loaded {                                                                                       
                 compactToDelBuff.updateFromBuf(value.(*DelDataBuf))                                           
             }                                                                                                 
...

How to fix

Increase task number received by Manual flush in each tick.
Check compacted segment even when there are no delete messages

Bits of Advice on usage

Don't generate small segments. Small segments cannot build indexes. We seal a segment whenever called num_entities now. With this in mind, I suggest don't call num_entities after every insert/delete. In order to get numOfRows, you can use get_query_segment_info

Aug 15 '22 09:08 XuanYang-cn

/assign @aoiasd

Aug 23 '22 07:08 XuanYang-cn

I could repro this quite reliably on v2.1.1. I just upgraded to milvusdb/milvus:master-20220914-16bfa63 and I can't repro any more.

Sep 14 '22 09:09 mausch

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Oct 31 '22 02:10 stale[bot]

milvus milvus copied to clipboard

[Bug]: panic: failed to flush delete data

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Milvus Log

milvus
milvus copied to clipboard