milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: the collection crash, after some entities delete

Open yesyue opened this issue 9 months ago • 5 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.4 
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 16c128G x41
- GPU: 0
- Others:

Current Behavior

the collection crash after some entities delete, and it can not load againt .

Then , i do some test , it happen againt .

PS: 生产环境实际上远没有测试时这么多的delete ,但是发生过一些delete的collection OOM一次后无法 load

Expected Behavior

No response

Steps To Reproduce

This Test can reproduce:
1、15 threads insert entites , everytime insert 100~150 randomly
2、15 threads delete entites , everytime delete 50 
3、datanode OOM 
4、reload the collection , the collection cannot load . try many times.
5、every time reload , the querynode OOM

Milvus Log

No response

Anything else?

No response

yesyue avatar May 09 '24 08:05 yesyue

@yesyue could you please attache the miluvs logs for investigation? Also if you could attach birdwatcher backup file would be perfect. /assign @yesyue /unassign

yanliang567 avatar May 10 '24 01:05 yanliang567

I am facing the same issue as well. I fell back to version 2.3.x and couldn't load it. After I delete and rebuild the index, the collection can be reloaded. In the event of a version 2.4.1 exception, collection can be loaded but querynode uses 100% cpu.

syang1997 avatar May 20 '24 08:05 syang1997

I am facing the same issue as well. I fell back to version 2.3.x and couldn't load it. After I delete and rebuild the index, the collection can be reloaded. In the event of a version 2.4.1 exception, collection can be loaded but querynode uses 100% cpu.

@syang1997 could you please summarize some steps to reproduce this issue(or what did you do before the issue pops up)? also please help to attach the milvus logs: refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

yanliang567 avatar May 20 '24 09:05 yanliang567

/assign @XuanYang-cn

xiaofan-luan avatar May 20 '24 09:05 xiaofan-luan

I am facing the same issue as well. I fell back to version 2.3.x and couldn't load it. After I delete and rebuild the index, the collection can be reloaded. In the event of a version 2.4.1 exception, collection can be loaded but querynode uses 100% cpu.

@syang1997 could you please summarize some steps to reproduce this issue(or what did you do before the issue pops up)? also please help to attach the milvus logs: refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

  1. Upgrade Milvus version from 2.3.4 to 2.4.1.
  2. Initial queries were normal after the upgrade completion.
  3. A large number of deletion operations were performed using the function DeleteByPks.
  4. Query latency was extremely high at 5,000 ms, querynode CPU usage reached 100%, and attu was display collection information.
  5. After reverting to version 2.3.4, collections could not be reloaded, and each collection would get stuck at a certain percentage.
  6. After deleting and rebuilding the vector index of the collection, it could be loaded normally, and the latency was back to normal.

syang1997 avatar May 21 '24 06:05 syang1997

Can I ask what is the index type?

liliu-z avatar May 21 '24 08:05 liliu-z

Can I ask what is the index type?

vector index is HNSW metric_type:L2 M:8 efConstruction:128

syang1997 avatar May 21 '24 10:05 syang1997

I am facing the same issue as well. I fell back to version 2.3.x and couldn't load it. After I delete and rebuild the index, the collection can be reloaded. In the event of a version 2.4.1 exception, collection can be loaded but querynode uses 100% cpu.

@syang1997 could you please summarize some steps to reproduce this issue(or what did you do before the issue pops up)? also please help to attach the milvus logs: refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

1. Upgrade Milvus version from 2.3.4 to 2.4.1.

2. Initial queries were normal after the upgrade completion.

3. A large number of deletion operations were performed using the function DeleteByPks.

4. Query latency was extremely high at 5,000 ms, querynode CPU usage reached 100%, and attu was  display collection information.

5. After reverting to version 2.3.4, collections could not be reloaded, and each collection would get stuck at a certain percentage.

6. After deleting and rebuilding the vector index of the collection, it could be loaded normally, and the latency was back to normal.

I believe the issue you met is not the same with this one @syang1997 you could drop the index the rebuild the index to workaround the issue. if you that does not work for you, please file a new issue for us. thanks,

yanliang567 avatar May 22 '24 02:05 yanliang567

there is no way you can downgrade from 2.4.1 to 2.3.4. 2.4.1 introduced L0 delete and 2.3.4 can not process those data format

xiaofan-luan avatar May 22 '24 04:05 xiaofan-luan

there is no way you can downgrade from 2.4.1 to 2.3.4. 2.4.1 introduced L0 delete and 2.3.4 can not process those data format

You mean the L0 Segment functionality, right? It indeed could not be loaded after downgrading to v2.3.x, but it could be loaded again after deleting the index and rebuilding it.

syang1997 avatar May 22 '24 06:05 syang1997

there is no way you can downgrade from 2.4.1 to 2.3.4. 2.4.1 introduced L0 delete and 2.3.4 can not process those data format

You mean the L0 Segment functionality, right? It indeed could not be loaded after downgrading to v2.3.x, but it could be loaded again after deleting the index and rebuilding it.

you are still losing delete data so this is not recommended.

xiaofan-luan avatar May 22 '24 06:05 xiaofan-luan

there is no way you can downgrade from 2.4.1 to 2.3.4. 2.4.1 introduced L0 delete and 2.3.4 can not process those data format

You mean the L0 Segment functionality, right? It indeed could not be loaded after downgrading to v2.3.x, but it could be loaded again after deleting the index and rebuilding it.

you are still losing delete data so this is not recommended.

Thank you, I understand the situation now. You mean the deletion information of L0 storage in v2.4.0 version has been lost due to rebuilding the index. I will make sure to add the missing deletion information later.

syang1997 avatar May 22 '24 07:05 syang1997

this has nothing to do with index build I guess.

delete stored in another format of 2.4 compared to 2.3

xiaofan-luan avatar May 22 '24 07:05 xiaofan-luan

this has nothing to do with index build I guess.

delete stored in another format of 2.4 compared to 2.3

When I couldn't load data after reverting to version 2.3.x, I successfully loaded it by rebuilding the index.

If it's not related to the index, why was I able to load it in version 2.3.x?

syang1997 avatar May 22 '24 09:05 syang1997

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 22 '24 12:06 stale[bot]