elasticsearch
elasticsearch copied to clipboard
High heap usage due to snapshot post-deletion cleanup
When deleting a snapshot we accumulate in memory a list of all the blobs that can be deleted after the repository update is committed. Each blob name takes only ~80B of heap, but it's possible for there to be very many blobs (it's theoretically unbounded). I've seen ~100M blobs to be deleted in practice, which can add up to several GiBs of heap in total. We should find a way to track this work with bounded heap usage.
Pinging @elastic/es-distributed (Team:Distributed)
As a quick improvement, I think we could accumulate the blob names in memory using a (compressed) BytesStreamOutput
rather than each one being a separate String
object. Each name should have ~17 bytes of entropy (16B for the UUID plus a little overhead) so that's a ~4.7× memory saving right away vs the 80-bytes-per-name we have at the moment.
As a slightly-less-quick (but still fairly quick) improvement that achieves O(1) memory usage: whenever such a BytesStreamOutput
gets large enough we could spill its contents out to a blob in the blob store and drop it from memory, then read it back in later on after the new RepositoryData
is committed and we're processing those deletes. That introduces some complexity around cleaning up those blobs after a master failover, but it seems surmountable.