OpenSearch
                                
                                 OpenSearch copied to clipboard
                                
                                    OpenSearch copied to clipboard
                            
                            
                            
                        [BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup
Describe the bug
We use Remote Purge threadpool to delete segments data for deleted indices in shallow snapshots. When the number of such indices are huge, as well as the count of snapshots are huge , we see a pile up of Remote Purge threadpool . In the above heap dump , we can see 30 million instance of Remote Purge threads hogging up the memory of  around 30 GB.
Related component
Storage:Remote
To Reproduce
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior
Remote Purge threadpool should be bounded .
Shallow Snapshot Deletion also needs to be smarter to handle this deletion in a scalable way .
Additional Details
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context Add any other context about the problem here.
Thanks for creating this issue @gbbafna,
Also one more optimization i can think of is that in the close() method of RemoteSegmentStoreDirectory class
https://github.com/opensearch-project/OpenSearch/blob/76ae14a4f2e99592610d3181543f5036214ceb7a/server/src/main/java/org/opensearch/index/store/RemoteSegmentStoreDirectory.java#L878-L879
we are cleaning up one segment file at a time which is followed by cleaning up corresponding md file and then at the end we are cleaning up the directories. Since we already know that shard is being closed after deletion we can instead directly cleanup the directories using BlobContainer.delete() which would internally use batch deletion in most of the repository implementations to cleanup the individual objects.
Let me know if this makes sense. will raise a draft PR for this in sometime along with some snapshot deletion side optimizations.
Let me know if this makes sense.
@harishbhakuni This sounds like a solid mitigation that will reduce the overhead when running into this issue. I think a draft PR would be a great next step if you can spin one up.
[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12]
@harishbhakuni The linked PR is closed. Will there be further PRs or this can be closed?
Hi @ashking94 , this issue can be closed.