tile38 icon indicating copy to clipboard operation
tile38 copied to clipboard

Follower memory spike and increased latency after leader AOF shrink

Open krkeshav opened this issue 8 months ago • 6 comments

Hi @tidwall , First of all love your projects, it is helping me so much to learn about Go. Just wanted to report about a strange behaviour in tile38 followers.

Describe the bug AOF shrink at leader causes memory spike in followers and increase in latency. The memory is not going down even after 6 hours now.

To Reproduce Have some data in AOF file, then execute aofshrink command on leader. This will result in increase in memory in followers as leader sends either command by command update or the whole AOF file. I even tried manually running gc command on followers but it doesn't work. Our AOF file size is around 27 MB.

Expected behavior Ideally this should not result in increase in memory on followers, even if there is increase it should go down after sometime.

Operating System (please complete the following information):

  • OS: Linux
  • CPU: Intel
  • Version: 1.30.0
  • Container: Kubernetes via this Helm Chart (link)

Additional context I am attaching screenshots to show the issue.

Increase in memory in follower pods: Image

Increase in network activity due to syncing with leader: Image

Increase in P99 of our services which directly rely on tile38 followers Image

krkeshav avatar Apr 30 '25 16:04 krkeshav

Hi krkeshav,

After a GC on the leader and follower, how does the memory between the two compare?

Does the follower memory continue to grow unbounded or does end up being a steady, albeit high, level?

tidwall avatar May 02 '25 19:05 tidwall

Hi @tidwall , As I said before after aofshrink the memory spike of followers has remained constant at around ~37% as shared before. Now as you suggested I ran GC on both leader and follower. Here are the findings:

Leader: No effect on CPU or memory trend of leader when I run GC Image

Follower: The memory goes down for a bit and CPU goes high due to GC but after few seconds the memory is back up to the previous threshold. Image

I also ran server command on both leader and follower, here's the response for both. Leader: { "ok": true, "stats": { "aof_size": 33245016, "avg_item_size": 554, "cpus": 16, "heap_released": 179208192, "heap_size": 310856832, "http_transport": true, "id": "3035dc2184515ed212aw4dfee20858ea", "in_memory_size": 11252180, "max_heap_size": 0, "mem_alloc": 310856832, "num_collections": 2, "num_hooks": 0, "num_objects": 67410, "num_points": 560867, "num_strings": 0, "pending_events": 0, "pid": 1, "pointer_size": 8, "read_only": false, "threads": 24, "version": "1.30.0" }, "elapsed": "3.598169ms" }

Follower: { "ok": true, "stats": { "aof_size": 66500340, "avg_item_size": 164, "caught_up": true, "caught_up_once": true, "cpus": 16, "following": "tile38-leader.svc.cluster.local:9851", "heap_released": 33382400, "heap_size": 92202792, "http_transport": true, "id": "71401d587dbf99e45a20ecf328dfv5dc", "in_memory_size": 11252180, "max_heap_size": 0, "mem_alloc": 92202792, "num_collections": 2, "num_hooks": 0, "num_objects": 67410, "num_points": 560867, "num_strings": 0, "pending_events": 0, "pid": 1, "pointer_size": 8, "read_only": false, "threads": 24, "version": "1.30.0" }, "elapsed": "201.299µs" }

One notable thing is that AOF size on follower is twice the size of the AOF size on leader. But in_memory_size remains same on both as well as num_object and num_points. Also leader memory size is more because of around 950 connected clients from multiple pods but follower has like 20-30 connections.

krkeshav avatar May 03 '25 03:05 krkeshav

Okay, I debugged this on my local and found a very interesting insight. I ran tile38 leader with some aof data and then I spawned a follower with no data. Then I connected it to leader and it synced to leader.

Image

The first spike in memory is due to leader syncing and at this time the aof size on both leader and follower was similar. But when I ran aofshrink on leader then there was another spike and now the aof size on follower doubled and the memory was almost double compared to prev value. Even after GC command it didn't go down. But the actual aof size (by running ls) on follower was correct i.e. same as leader but server command was returning double.

Here comes the interesting part, when I ran aofshrink again then my memory came down and now the aof file size on follower reported by server was back to original size. If you observe the above graph then the dip is due to the second aofshrink. After that I ran lots of subsequent aofshrink and gc but memory was same.

So, seeing this pattern I ran aofshrink on my prod server the second time and suprisingly the memory came down. I am also seeing dip in latency. But to confirm regarding latency I need to monitor for a day at peak hours.

Image

There's something odd happening at first aofshrink. It would be very useful if you can share what happens when we shrink aof and how leader communicates to followers and also the reason for this behaviour.

krkeshav avatar May 03 '25 04:05 krkeshav

I appreciate the detailed description of the problem.

It would be very useful if you can share what happens when we shrink aof and how leader communicates to followers and also the reason for this behaviour.

The AOFSHRINK command rewrites the AOF file so it can be as small as possible. It does this be scanning each collection and writing each object, one by one, to a new AOF file. Then it discards the old one. The code is in the aofshink.go file, and it's a doozy of function.

One thing to keep in mind is that the shrink operation must lock the database when it reads object from the collections. But it doesn't hold the lock for very long, just enough time to get up to 32 objects at a time. Then unlocks, locks again for the next 32 object, and so forth. This ensures the system as a whole continues to run. But because this locking exists, even though they're very quick, you will probably see some increase in latency for connected clients.

But the logic is rather simple. When called:

  1. It first makes sure that only one no one is running the shrink operation, if so it bails.
  2. Then it stores the size of the current AOF file, this is used at the end of the operation to copy any new AOF commands that have been written while the operation was running.
  3. It creates a new file with same name as the active on but with the "-shrink" suffix. Upon completion this becomes the new AOF file.
  4. Then it loops over every collection and object, writing up to 32 objects at a time to a new AOF buffer. This is where the database is locked for moment.
  5. After each 32 object chunk, and while unlocked, it flushes the buffer to the new AOF file.
  6. After all known objects are written to the new file, the operation takes the previously stored size from step 2, and copies all the new command that have been written to the current AOF file into the new AOF file.
  7. Finally, it syncs the new file to disk and atomically copies the new file over the old file.

tidwall avatar May 03 '25 17:05 tidwall

Thank you so much for the explanation @tidwall Based on your expertise, any insight on why the first aofshrink is behaving in such a way and the second aofshrink is fixing the issue? If you can point me in the right direction, I can try debugging this if it's really an issue and raise a PR.

I have been trying to understand the whole codebase but it's huge, haha. It will take me some time to fully understand it. Currently my understanding is at the level where I was able to download the aof file using a Go script by connecting to the leader as a follower using aof command.

krkeshav avatar May 03 '25 19:05 krkeshav

Based on your expertise, any insight on why the first aofshrink is behaving in such a way and the second aofshrink is fixing the issue?

There's nothing inherent about the aofshink operation that would cause a significant change to memory. The buffers generally stay small and there are no allocations that outlive the function, meaning the heap should not be greatly affected after the operation is completed.

It's very possible that what you are seeing with the memory spikes and drops is due to heuristics of the Go's memory management.

For a program like Tile38, where it's possible to have hundreds of millions of objects, all individual allocations; it relies heavily on Go's memory management and garbage collector to keep the system running as smooth as possible.

Go makes decisions to run the garbage collector and free RSS memory based on various factors, which at times feels a bit like a black box. The internal decisions that Go makes around memory management has changed quite a bit over the years. For example, when Tile38 was first released in early 2016, concurrent mark and sweep GC was new (Go 1.5), before that the GC was stop the world, which would make a program like Tile38 unusable.

Since then there's been plenty of changes. All seem pretty decent, particularly for Tile38, with the most significant in 2019 for Go 1.12. This change greatly improved the stability and performance of massive in-memory data stores like Tile38, and I expressed my excitement briefly back then.

A Tile38 follower has different memory usage than the leader. A leader will hold more write locks in general and is responsible for more network writes too. While the follower, depending on the how it's used, may have fewer mutations than the leader. The usage difference may lead to vastly different memory access patterns. Patterns that could affect how the memory management behaves. In short, I would shy away from expecting the follower RSS memory to match one-to-one with the leader.

Rather, what I've learned early on is that the most important thing to be concerned with when monitoring RSS memory usage is if the memory grows unbounded (memory leaks) or you are seeing out of memory panics. If the memory seems to spike but then lower, or it feels sorta high yet levels off over time, then it may not be a concern.

I have been trying to understand the whole codebase but it's huge, haha.

Yes there's a lot in there :)

If you do discover a memory leak in the aofshrink operation (or anywhere in Tile38 for that matter) feel free to report and we’ll get it fixed asap. :)

tidwall avatar May 03 '25 20:05 tidwall