elasticsearch High heap usage due to snapshot post-deletion cleanup

High heap usage due to snapshot post-deletion cleanup

Open DaveCTurner opened this issue 9 months ago • 1 comments

When deleting a snapshot we accumulate in memory a list of all the blobs that can be deleted after the repository update is committed. Each blob name takes only ~80B of heap, but it's possible for there to be very many blobs (it's theoretically unbounded). I've seen ~100M blobs to be deleted in practice, which can add up to several GiBs of heap in total. We should find a way to track this work with bounded heap usage.

May 04 '24 07:05 DaveCTurner

Pinging @elastic/es-distributed (Team:Distributed)

May 04 '24 07:05 elasticsearchmachine

As a quick improvement, I think we could accumulate the blob names in memory using a (compressed) BytesStreamOutput rather than each one being a separate String object. Each name should have ~17 bytes of entropy (16B for the UUID plus a little overhead) so that's a ~4.7× memory saving right away vs the 80-bytes-per-name we have at the moment.

As a slightly-less-quick (but still fairly quick) improvement that achieves O(1) memory usage: whenever such a BytesStreamOutput gets large enough we could spill its contents out to a blob in the blob store and drop it from memory, then read it back in later on after the new RepositoryData is committed and we're processing those deletes. That introduces some complexity around cleaning up those blobs after a master failover, but it seems surmountable.

May 14 '24 09:05 DaveCTurner

elasticsearch elasticsearch copied to clipboard

High heap usage due to snapshot post-deletion cleanup

elasticsearch
elasticsearch copied to clipboard