OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup

Open gbbafna opened this issue 1 year ago • 2 comments

Describe the bug

image (1)(1)

We use Remote Purge threadpool to delete segments data for deleted indices in shallow snapshots. When the number of such indices are huge, as well as the count of snapshots are huge , we see a pile up of Remote Purge threadpool . In the above heap dump , we can see 30 million instance of Remote Purge threads hogging up the memory of around 30 GB.

Related component

Storage:Remote

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Remote Purge threadpool should be bounded .

Shallow Snapshot Deletion also needs to be smarter to handle this deletion in a scalable way .

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

gbbafna avatar Feb 08 '24 11:02 gbbafna

Thanks for creating this issue @gbbafna, Also one more optimization i can think of is that in the close() method of RemoteSegmentStoreDirectory class https://github.com/opensearch-project/OpenSearch/blob/76ae14a4f2e99592610d3181543f5036214ceb7a/server/src/main/java/org/opensearch/index/store/RemoteSegmentStoreDirectory.java#L878-L879 we are cleaning up one segment file at a time which is followed by cleaning up corresponding md file and then at the end we are cleaning up the directories. Since we already know that shard is being closed after deletion we can instead directly cleanup the directories using BlobContainer.delete() which would internally use batch deletion in most of the repository implementations to cleanup the individual objects.

Let me know if this makes sense. will raise a draft PR for this in sometime along with some snapshot deletion side optimizations.

harishbhakuni avatar Feb 13 '24 16:02 harishbhakuni

Let me know if this makes sense.

@harishbhakuni This sounds like a solid mitigation that will reduce the overhead when running into this issue. I think a draft PR would be a great next step if you can spin one up.

peternied avatar Feb 14 '24 10:02 peternied

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12]

@harishbhakuni The linked PR is closed. Will there be further PRs or this can be closed?

ashking94 avatar Apr 18 '24 15:04 ashking94

Hi @ashking94 , this issue can be closed.

harishbhakuni avatar Apr 30 '24 16:04 harishbhakuni