milvus
milvus copied to clipboard
[Bug]: the collection crash, after some entities delete
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.4
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 16c128G x41
- GPU: 0
- Others:
Current Behavior
the collection crash after some entities delete, and it can not load againt .
Then , i do some test , it happen againt .
PS: 生产环境实际上远没有测试时这么多的delete ,但是发生过一些delete的collection OOM一次后无法 load
Expected Behavior
No response
Steps To Reproduce
This Test can reproduce:
1、15 threads insert entites , everytime insert 100~150 randomly
2、15 threads delete entites , everytime delete 50
3、datanode OOM
4、reload the collection , the collection cannot load . try many times.
5、every time reload , the querynode OOM
Milvus Log
No response
Anything else?
No response
@yesyue could you please attache the miluvs logs for investigation? Also if you could attach birdwatcher backup file would be perfect. /assign @yesyue /unassign
I am facing the same issue as well. I fell back to version 2.3.x and couldn't load it. After I delete and rebuild the index, the collection can be reloaded. In the event of a version 2.4.1 exception, collection can be loaded but querynode uses 100% cpu.
I am facing the same issue as well. I fell back to version 2.3.x and couldn't load it. After I delete and rebuild the index, the collection can be reloaded. In the event of a version 2.4.1 exception, collection can be loaded but querynode uses 100% cpu.
@syang1997 could you please summarize some steps to reproduce this issue(or what did you do before the issue pops up)? also please help to attach the milvus logs: refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.
/assign @XuanYang-cn
I am facing the same issue as well. I fell back to version 2.3.x and couldn't load it. After I delete and rebuild the index, the collection can be reloaded. In the event of a version 2.4.1 exception, collection can be loaded but querynode uses 100% cpu.
@syang1997 could you please summarize some steps to reproduce this issue(or what did you do before the issue pops up)? also please help to attach the milvus logs: refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.
- Upgrade Milvus version from 2.3.4 to 2.4.1.
- Initial queries were normal after the upgrade completion.
- A large number of deletion operations were performed using the function DeleteByPks.
- Query latency was extremely high at 5,000 ms, querynode CPU usage reached 100%, and attu was display collection information.
- After reverting to version 2.3.4, collections could not be reloaded, and each collection would get stuck at a certain percentage.
- After deleting and rebuilding the vector index of the collection, it could be loaded normally, and the latency was back to normal.
Can I ask what is the index type?
Can I ask what is the index type?
vector index is HNSW metric_type:L2 M:8 efConstruction:128
I am facing the same issue as well. I fell back to version 2.3.x and couldn't load it. After I delete and rebuild the index, the collection can be reloaded. In the event of a version 2.4.1 exception, collection can be loaded but querynode uses 100% cpu.
@syang1997 could you please summarize some steps to reproduce this issue(or what did you do before the issue pops up)? also please help to attach the milvus logs: refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.
1. Upgrade Milvus version from 2.3.4 to 2.4.1. 2. Initial queries were normal after the upgrade completion. 3. A large number of deletion operations were performed using the function DeleteByPks. 4. Query latency was extremely high at 5,000 ms, querynode CPU usage reached 100%, and attu was display collection information. 5. After reverting to version 2.3.4, collections could not be reloaded, and each collection would get stuck at a certain percentage. 6. After deleting and rebuilding the vector index of the collection, it could be loaded normally, and the latency was back to normal.
I believe the issue you met is not the same with this one @syang1997 you could drop the index the rebuild the index to workaround the issue. if you that does not work for you, please file a new issue for us. thanks,
there is no way you can downgrade from 2.4.1 to 2.3.4. 2.4.1 introduced L0 delete and 2.3.4 can not process those data format
there is no way you can downgrade from 2.4.1 to 2.3.4. 2.4.1 introduced L0 delete and 2.3.4 can not process those data format
You mean the L0 Segment functionality, right? It indeed could not be loaded after downgrading to v2.3.x, but it could be loaded again after deleting the index and rebuilding it.
there is no way you can downgrade from 2.4.1 to 2.3.4. 2.4.1 introduced L0 delete and 2.3.4 can not process those data format
You mean the L0 Segment functionality, right? It indeed could not be loaded after downgrading to v2.3.x, but it could be loaded again after deleting the index and rebuilding it.
you are still losing delete data so this is not recommended.
there is no way you can downgrade from 2.4.1 to 2.3.4. 2.4.1 introduced L0 delete and 2.3.4 can not process those data format
You mean the L0 Segment functionality, right? It indeed could not be loaded after downgrading to v2.3.x, but it could be loaded again after deleting the index and rebuilding it.
you are still losing delete data so this is not recommended.
Thank you, I understand the situation now. You mean the deletion information of L0 storage in v2.4.0 version has been lost due to rebuilding the index. I will make sure to add the missing deletion information later.
this has nothing to do with index build I guess.
delete stored in another format of 2.4 compared to 2.3
this has nothing to do with index build I guess.
delete stored in another format of 2.4 compared to 2.3
When I couldn't load data after reverting to version 2.3.x, I successfully loaded it by rebuilding the index.
If it's not related to the index, why was I able to load it in version 2.3.x?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.