milvus
milvus copied to clipboard
[Bug]: panic: failed to flush delete data
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: v2.1.0-hotfix-dcd6c9e
- Deployment mode(standalone or cluster): standalone
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.0.2
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory:
Limits:
cpu: 32
memory: 16Gi
Requests:
cpu: 4
memory: 16Gi
- GPU:
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
- Others:
Deployment is running on EKS
Current Behavior
When I deploy a fresh milvus standalone setup via the helm chart and we start inserting and deleting data after a short time (sometimes minutes, sometimes hours) the setup panics. The worst thing is, that once it panicked, it is not possible to get it running again. Only by stopping everything, deleting the PVCs and recreating an empty cluster from scratch without any data it is possible to get a running system. The system with inserted data cannot be revived.
We consume from kafka with a python kafka consumer and use the milvus python sdk to frequently insert and delete vectors with 100 - 300 dimensions (we experiment with different setups).
We were running on Milvus 1.1.1 and as we had some performance problems we recently upgraded to Milvus 2, we were trying out Milvus 2.0.2 , 2.1.0 and v2.1.0-hotfix-dcd6c9e , first in the clustered version and then as this failed in standalone mode.Our usecase is that we consume messages from kafka, vectorize them and then put them into Milvus via the python SDK, we also have a webservice which queries Milvus. The inserts are one insert every couple of seconds, so about 20-60 inserts into Milvus per Minute. After some minutes or hours the clustered setup failed. With querynodes going offline and not being able to come online again (the logs do not show any meaningful message).When I try to reproduce the issues locally with some dummy generators/searches and docker-compose it seems to work quite well, without errors, even with 500.000 vectors. But in the setup deployed in Kubernetes we get issues of all kinds (dying queryNode -> queryCoord, failed to find shared leader, etc.) . Also Standalone deployed on kubernetes does not run stable with very little load and vectors (20-60 inserts per Minute, about 1-10k vectors).
open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: failed to flush delete data, err = All attempts results:
attempt #1:cannot find segment, id = 435148680510046209
attempt #2:cannot find segment, id = 435148680510046209
attempt #3:cannot find segment, id = 435148680510046209
attempt #4:cannot find segment, id = 435148680510046209
attempt #5:cannot find segment, id = 435148680510046209
goroutine 1551 [running]:
github.com/milvus-io/milvus/internal/datanode.(*deleteNode).Operate(0xc000f0db90, 0xc0079068c0, 0x1, 0x1, 0x0, 0x0, 0x0)
/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_delete_node.go:249 +0x10c5
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc002408680)
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70
Other logs:
__ _________ _ ____ ______
/ |/ / _/ /| | / / / / / __/
/ /|_/ // // /_| |/ / /_/ /\ \
/_/ /_/___/____/___/\____/___/
Welcome to use Milvus!
Version: v2.1.0-hotfix-dcd6c9e
Built: Mon Aug 1 08:54:24 UTC 2022
GitCommit: dcd6c9e5
GoVersion: go version go1.16.9 linux/amd64
open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: insertNode processDeleteMessages failed, collectionID = 435147331058208065, err = partition 435148989329309697 hasn't been loaded or has been released, channel: by-dev-rootcoord-dml_1_435147331058208065v1
goroutine 191864 [running]:
github.com/milvus-io/milvus/internal/querynode.(*insertNode).Operate(0xc00b047170, 0xc00caef170, 0x1, 0x1, 0x0, 0x0, 0x0)
/go/src/github.com/milvus-io/milvus/internal/querynode/flow_graph_insert_node.go:234 +0x3c45
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc005772780)
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70
milvus2-etcd-0 1/1 Running 0 19h
milvus2-minio-7f7556f4b7-xp7r8 1/1 Running 0 18h
milvus2-standalone-c7bd4b9d9-gf8pj 0/1 Running 51 19h
Expected Behavior
- Milvus should run stable and not shutdown because of (recoverable?) errors
- Milvus should be able to recover itself if it finds unconsistent state
- Milvus should never end up in a state where it can't heal itself
Steps To Reproduce
1. Deploy milvus standalone on EKS cluster
2. Insert and delete vectors in parallel (same id can be inserted, deleted multiple times in relatively short succession (several seconds)) in a rate about 20-60 times per minute
3. After a short time some raceconditions (?) appear which bring down milvus
4. Milvus is not able to recover from this and restarts endlessly
Milvus Log
I have 60MB of logs (gzipped 2.3 MB) where can I drop them?
Some excerpt:
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435148894957731843/435148894957731849] [error="file/delta_log/43514733105820806
5/435147415318167553/435148894957731843/435148894957731849 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).work\n\t/
go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go:98"]
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435149549579796488/435149549592903686] [error="file/delta_log/43514733105820806
5/435147415318167553/435149549579796488/435149549592903686 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/
...
[2022/08/09 04:38:00.026 +00:00] [DEBUG] [client/client.go:91] ["QueryCoordClient msess key not existed"] [key=querycoord]
[2022/08/09 04:38:00.026 +00:00] [ERROR] [grpcclient/client.go:140] ["failed to get client address"] [error="find no available querycoord, check querycoord state"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).connect\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:140\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).GetGrpcClient\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:112\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).callOnce\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:202\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:257\ngithub.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).GetComponentStates\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:118\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:65\ngithub.com/milvus-io/milvus/internal/util/retry.Do\n\t/go/src/github.com/milvus-io/milvus/internal/util/retry/retry.go:37\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:89\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentHealthy\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:104\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:461\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:288\ngithub.com/milvus-io/milvus/cmd/components.(*Proxy).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/proxy.go:50\ngithub.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).runProxy.func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:155"]
@xiaofan-luan could you please take a look at this issue? /assign @xiaofan-luan /unassign
/assign @XuanYang-cn pls help on it
@xiaofan-luan: GitHub didn't allow me to assign the following users: pls, help, on, it.
Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
/assign @XuanYang-cn pls help on it
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Milvus Log
I have 60MB of logs (gzipped 2.3 MB) where can I drop them?
@datenhahn I would like to exam the logs, please email to me at [email protected]
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435148894957731843/435148894957731849] [error="file/delta_log/43514733105820806
5/435147415318167553/435148894957731843/435148894957731849 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).work\n\t/
go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go:98"]
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435149549579796488/435149549592903686] [error="file/delta_log/43514733105820806
5/435147415318167553/435149549579796488/435149549592903686 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/
This's a bug need to fix, but it doesn't influence the recovery process and doesn't cause panic, so I'll leave an issue #18585 and keep investigating.
open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: failed to flush delete data, err = All attempts results:
attempt #1:cannot find segment, id = 435148680510046209
attempt #2:cannot find segment, id = 435148680510046209
attempt #3:cannot find segment, id = 435148680510046209
attempt #4:cannot find segment, id = 435148680510046209
attempt #5:cannot find segment, id = 435148680510046209
goroutine 1551 [running]:
github.com/milvus-io/milvus/internal/datanode.(*deleteNode).Operate(0xc000f0db90, 0xc0079068c0, 0x1, 0x1, 0x0, 0x0, 0x0)
/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_delete_node.go:249 +0x10c5
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc002408680)
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70
@datenhahn could you email me the logs containing this part?
@XuanYang-cn I have mailed you the other logs, I didn't realize the logs have been rotated, so the new file is 300MB big.
Maybe some more information about our setup which might be special. We use lots of partitions to speed up searches. We add every vector in multiple partitions:
Year (e.g. 2022) Month (e.g. 2022-08) Week (e.g. 2022-08-KW32) Day (e.g. 2022-08-10)
So we have lots of partitions with (initially) few vectors, the more vectors we add, the partitions fill up with more data.
@datenhahn I didn't receive the 300MB log, could you please resend me one? I'll try to reproduce this panic locally following your usage information while waiting for the logs.
I have some questions about usage details, looking forward to your reply:
- Did you load collection or load some of the partitions?
- Did you call
num_entitiesevery time there's an insert/delete? - How did you Query or Search after insert? Was it Query by ID or Search with limits?
Now I'm trying to figure these problems:
- What caused datanode panic, lack of information
- What caused querynode panic, lack of information too
- What made querycoord unable to recover, maybe related to the cluster #18595
@XuanYang-cn I have uploaded the logs here now, maybe that works: https://drive.google.com/file/d/1TJYGAx_H2emOcF0XQruhlozDzGI9008I/view?usp=sharing
- We loaded the collection (I was not aware it is possible or required to load partitions separately) and collection was shown in attu as "loaded"
- Yes, I just saw we used it as debugging information
logger.info(f"Number of all inserted vectors: {inserted_vectors.insert_count}.")
logger.info(
f"Vectors {inserted_vectors.primary_keys} successfully added to partition {partition} of collection {collection.name}")
logger.info(
f"New number of entities in collection {collection.name}:\n{collection.num_entities}")
- The delete uses a "id in ..." expression, the search uses vectors with limit. The delete use the following expression:
check_if_id_exists = f"id in {[id]}"
deleted_entity = collection.delete(expr=check_if_id_exists, partition_name=partition)
The queries use vector search:
@run_time
def query_milvus(query_vectors, request, search_partitions, path=""):
"""Return the vectors of documents that are most close to the query vector"""
list_of_query_vectors = [dict_elem["embeddings"] if isinstance(dict_elem, dict) else dict_elem for dict_elem in query_vectors]
query_results = collection.search(
data=list_of_query_vectors,
anns_field="embeddings",
param=search_params,
partition_names=search_partitions,
limit=request.num_similar_documents
)
logger.info(f"Milvus Suche beendet. Ergebnisse: {query_results}")
return query_results
Regarding the problems:
- DataNode might have been killed by the kubernetes cluster (e.g. we are using AWS Spot Instances)
Regarding the rest, maybe you find something in the logfiles.
Here are event more logs in the other issue ( https://github.com/milvus-io/milvus/issues/18595#issuecomment-1210780825 ) it is the same setup. I tried with Milvus Cluster and Milvus Standalone on Kubernetes, but both run into issues.
For me I see the following problems:
- If milvus is running into ressource issues it should block/reject further insertion/deletion requests, but it should not completely shutdown
- If milvus is entering an invalid cluster state (e.g. important services like datanode, etc. missing) it should not panic, but continue logging/retrying until the resource is available again
- If milvus for whatever reason dies from panics or because AWS or Kubernetes ripped some cluster nodes away, it should be able to recover itself into some workable/debuggable state
- If the damage done to the cluster is too large and unhealable (e.g. permanent data loss, etc.) it can either completely shut down with meaningful messages or it can recover with the data still available (danger of not noticing that there was a data loss)
- One note: It might be that liveness probes prevent the cluster getting in valid states if recovery takes longer than liveness probes and kubernetes marks the pod as dead and terminates and recreates it introducing and endless recovery/restart loop. But while the restart loop this seems to happen here, I don't think this is an issue. It is not an issue of time currently, as queryCoord never will recover (for a test to see if it needs more time I was increasing the liveness probe retries to a very high number).
Currently I have the feeling:
- Once resource limits are reached (memory, queuesizes, whatever) milvus will just panic
- Once paniced it cannot pull itself together anymore due to inconsistent state
- The required sizings for many vectors/partitions and required config settings (e.g. queuesizes, memory sizes, config settings) are not clear, maybe we need to configure something differently?
Hi @datenhahn , thanks for the very detailed feedback. The problems you point out is exactly what we are trying to deliver on Milvus 2.2:
- Some limitation of data ingestion rate and data size to avoid system crash
- instead of panic, processing some of the requests asyncly and retry until the service recovery
- Protection and better scheduling on search requests.
- Refactor meta to avoid inconsistency(especially when failure happens) ....... Stabiltiy will be the first priority of the community~ If you'd like to help that would be great because we love your suggestions
@datenhahn feel free to reach me out at [email protected]~ I'd like to offer more help
Why DataNode panics at delete_buffer_node and how we're going to fix it.
- insert_buffer_node process flush message 200ms(tt) per segment, when there's a large number of flushed segments arrived, it takes a very very long time to pass the flush message.
- There're lots of small segments, tens of thousands of segment with num_rows==1, generating many compactions
- When compaction happens after flush yet the compaction is executed faster than flush, delete_node will panic.
347 // Manual Flush
348 select {
349 case fmsg := <-ibNode.flushChan:
350 log.Info("(Manual Flush) receiving flush message",
352 zap.Int64("collectionID", fmsg.collectionID),
...
for i, msg := range fgMsg.deleteMessages {
tmpSegIDs, err := dn.bufferDeleteMsg(msg, fgMsg.timeRange)
}
...
func (dn *deleteNode) bufferDeleteMsg(msg *msgstream.DeleteMsg, tr TimeRange) ([]UniqueID, error) {
...
// Update delBuf for merged segments
compactedTo2From := dn.replica.listCompactedSegmentIDs()
for compactedTo, compactedFrom := range compactedTo2From {
compactToDelBuff := newDelDataBuf()
for _, segID := range compactedFrom {
value, loaded := dn.delBuf.LoadAndDelete(segID)
if loaded {
compactToDelBuff.updateFromBuf(value.(*DelDataBuf))
}
...
How to fix
- Increase task number received by Manual flush in each tick.
- Check compacted segment even when there are no delete messages
Bits of Advice on usage
- Don't generate small segments. Small segments cannot build indexes.
We seal a segment whenever called
num_entitiesnow. With this in mind, I suggest don't callnum_entitiesafter every insert/delete. In order to get numOfRows, you can useget_query_segment_info
/assign @aoiasd
I could repro this quite reliably on v2.1.1.
I just upgraded to milvusdb/milvus:master-20220914-16bfa63 and I can't repro any more.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.