OpenSearch [BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup

[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup

Open gbbafna opened this issue 1 year ago • 2 comments

Describe the bug

image (1)(1)

We use Remote Purge threadpool to delete segments data for deleted indices in shallow snapshots. When the number of such indices are huge, as well as the count of snapshots are huge , we see a pile up of Remote Purge threadpool . In the above heap dump , we can see 30 million instance of Remote Purge threads hogging up the memory of around 30 GB.

Related component

Storage:Remote

To Reproduce

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

Remote Purge threadpool should be bounded .

Shallow Snapshot Deletion also needs to be smarter to handle this deletion in a scalable way .

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context Add any other context about the problem here.

Feb 08 '24 11:02 gbbafna

Thanks for creating this issue @gbbafna, Also one more optimization i can think of is that in the close() method of RemoteSegmentStoreDirectory class https://github.com/opensearch-project/OpenSearch/blob/76ae14a4f2e99592610d3181543f5036214ceb7a/server/src/main/java/org/opensearch/index/store/RemoteSegmentStoreDirectory.java#L878-L879 we are cleaning up one segment file at a time which is followed by cleaning up corresponding md file and then at the end we are cleaning up the directories. Since we already know that shard is being closed after deletion we can instead directly cleanup the directories using BlobContainer.delete() which would internally use batch deletion in most of the repository implementations to cleanup the individual objects.

Let me know if this makes sense. will raise a draft PR for this in sometime along with some snapshot deletion side optimizations.

Feb 13 '24 16:02 harishbhakuni

Let me know if this makes sense.

@harishbhakuni This sounds like a solid mitigation that will reduce the overhead when running into this issue. I think a draft PR would be a great next step if you can spin one up.

Feb 14 '24 10:02 peternied

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12]

@harishbhakuni The linked PR is closed. Will there be further PRs or this can be closed?

Apr 18 '24 15:04 ashking94

Hi @ashking94 , this issue can be closed.

Apr 30 '24 16:04 harishbhakuni

OpenSearch OpenSearch copied to clipboard

[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

OpenSearch
OpenSearch copied to clipboard