awesome icon indicating copy to clipboard operation
awesome copied to clipboard

Memory leak over time up until OOM-killer kills awesome

Open ivan-kolmychek opened this issue 4 years ago • 20 comments

Extracted from #2798, as asked by @psychon.

Original message from this comment by me:


I hope you don't mind me joining in here - I believe I may have similar issue, but it happens regardless of is there any activity or not. It started somewhere around a 3-6 months ago, after one of the system updates.

(Disclaimer: I've tried to provide as much information as possible, please excuse me if most of it will prove to be useless)

It manifests itself as awesome eating more and more memory over time, and if left unchecked it eventually eats all available memory and gets killed by oom. I actually assumed it was a problem in my configuration somewhere and got into habbit of restarting awesome occasionally, but today I've decided it's time to look around and see if anyone else has this problem.

I use Arch Linux, keep it up-to-date most of the time.

I have 8Gb of RAM and use 1/8 of that for zram:

[ivanko] ~
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7860         860        5684         215        1316        6495
Swap:           982         156         826
[ivanko] ~
$ swapon
NAME       TYPE        SIZE   USED  PRIO
/dev/zram0 partition 982.6M 156.4M 32767

Here is awesome --version:

awesome v4.3 (Too long)
 • Compiled against Lua 5.3.5 (running with Lua 5.3)
 • D-Bus support: ✔
 • execinfo support: ✔
 • xcb-randr version: 1.6
 • LGI version: 0.9.2

Awesome is integrated as window manager in Plasma (KDE):

[ivanko] ~
$ cat ./.config/plasma-workspace/env/set_window_manager.sh
export KDEWM=awesome

Here is relevant version of my current rc.lua and here is relevant version of my current theme.

It may very well be that something in there is responsible for this, but I've kept it relatively unchanged for quite a long time and I don't remember having this issue before.

Here are some examples of message in dmesg from oom-killer about killing awesome:

...
[35453.563484] Out of memory: Killed process 895 (awesome) total-vm:6524164kB, anon-rss:6322804kB, file-rss:644kB, shmem-rss:8200kB
[35453.826430] oom_reaper: reaped process 895 (awesome), now anon-rss:0kB, file-rss:0kB, shmem-rss:8200kB
...
[71649.860006] Out of memory: Killed process 22366 (awesome) total-vm:6412504kB, anon-rss:6203484kB, file-rss:888kB, shmem-rss:8200kB
[71650.113963] oom_reaper: reaped process 22366 (awesome), now anon-rss:0kB, file-rss:0kB, shmem-rss:8200kB
...

Here is results of running the suggested code once in a while:

[ivanko] ~
$ date; awesome-client 'local a = collectgarbage("count") collectgarbage() return a, collectgarbage("count")'
Mon 19 Aug 2019 02:07:00 PM EEST
   double 4477.29
   double 4386.45
[ivanko] ~
$ date; awesome-client 'local a = collectgarbage("count") collectgarbage() return a, collectgarbage("count")'
Mon 19 Aug 2019 02:18:08 PM EEST
   double 16116.3
   double 6450.73
[ivanko] ~
$ date; awesome-client 'local a = collectgarbage("count") collectgarbage() return a, collectgarbage("count")'
Mon 19 Aug 2019 02:33:27 PM EEST
   double 18361.5
   double 7539.17
[ivanko] ~
$ date; awesome-client 'local a = collectgarbage("count") collectgarbage() return a, collectgarbage("count")'
Mon 19 Aug 2019 02:43:17 PM EEST
   double 13798.1
   double 6525.56
[ivanko] ~
$ date; awesome-client 'local a = collectgarbage("count") collectgarbage() return a, collectgarbage("count")'
Mon 19 Aug 2019 03:57:44 PM EEST
   double 14617.3
   double 6685.55
   [ivanko] ~
$ date; awesome-client 'local a = collectgarbage("count") collectgarbage() return a, collectgarbage("count")'
Mon 19 Aug 2019 04:37:59 PM EEST
   double 17576.7
   double 8114.28

If I use (or leave) laptop for a long time this happens:

Before:

[ivanko] ~
$ date; ps vx | grep awesome
Mon 19 Aug 2019 07:31:07 PM EEST
  885 ?        S      0:00      1    12 52187     0  0.0 kwrapper5 /usr/bin/ksmserver --windowmanager awesome
  886 ?        Sl     0:22    128     4 385623 11172  0.1 /usr/bin/ksmserver --windowmanager awesome
17345 pts/11   S+     0:00      0   114  6457  2368  0.0 grep awesome
31415 ?        Sl     3:36     63   250 917929 744836  9.2 awesome

Some time (usually few hours) later:

[104616.957517] Out of memory: Killed process 31415 (awesome) total-vm:6361980kB, anon-rss:6159304kB, file-rss:1556kB, shmem-rss:8200kB
[104617.291471] oom_reaper: reaped process 31415 (awesome), now anon-rss:0kB, file-rss:0kB, shmem-rss:8200kB

As a side-note on temp solution I've used for now, if this happens, I usually either re-login or just just log into the console outside of X and use DISPLAY=:0 awesome & to bring it up.

ivan-kolmychek avatar Aug 21 '19 16:08 ivan-kolmychek

@psychon issue seems to persist even after I've replaced wallpaper fn with a static file:

...
-- Adapted from https://gist.github.com/anonymous/9072154f03247ab6e28c
local wallpaper_filter = function(s)
  return string.match(s,"%.png$") or string.match(s,"%.jpg$")
end
local wallpaper_dir = basedir .. "/wallpapers/current/"
local wallpaper_files = scandir(wallpaper_dir, wallpaper_filter)
--theme.wallpaper = function()
--  local wallpaper_index = math.random(1, #wallpaper_files)
--  local wallpaper_file = wallpaper_files[wallpaper_index]
--
--  return wallpaper_dir .. wallpaper_file
--end
theme.wallpaper = wallpaper_dir .. "172787.jpg"
...

I have reloaded Awesome after this and confirmed that wallpaper now stays static, just in case.

Memory at the start:

[ivanko] ~
$ date; ps -eo comm,rss | grep awesome
Wed 21 Aug 2019 04:25:03 PM EEST
awesome         288084

In a few hours this happens:

[148060.118981] Out of memory: Killed process 30163 (awesome) total-vm:6494612kB, anon-rss:6260768kB, file-rss:2144kB, shmem-rss:8200kB
[148060.381534] oom_reaper: reaped process 30163 (awesome), now anon-rss:0kB, file-rss:0kB, shmem-rss:8200kB

Please let me know if you want me to try changing anything else, maybe the issue is still somewhere in my config.

ivan-kolmychek avatar Aug 21 '19 16:08 ivan-kolmychek

@psychon Oops, I've forgot to disable the timer that still tries to switch wallpaper. My bad.

I've commented it out in rc.lua:

awful.screen.connect_for_each_screen(function(s)
    -- Wallpaper
    set_wallpaper(s)

    --local wallpaper_change_time = 30
    --local wallpaper_change_timer = gears.timer {
    --  timeout = wallpaper_change_time
    --}
    --wallpaper_change_timer:connect_signal("timeout", function()
    --  set_wallpaper(s)
    --  wallpaper_change_timer:stop()
    --  wallpaper_change_timer.timeout = wallpaper_change_time
    --  wallpaper_change_timer:start()
    --end)
    --
    --wallpaper_change_timer:start()

    -- Each screen has its own tag table.
    awful.tag({ "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f" }, s, awful.layout.layouts[1])

Let's see does this fix the issue.

ivan-kolmychek avatar Aug 22 '19 13:08 ivan-kolmychek

I have a static wallpaper and faced also memory leak

ymartin59 avatar Aug 22 '19 20:08 ymartin59

@ymartin59 interesting. It may be completely unrelated issue, because so far I've seen no memory leaks since I've made the change to rc.lua I've mentioned here - now ps -eo comm,rss | grep awesome shows for my setup numbers around 64000-65000 regardless of amount of activity and time passed.

I haven't reported any of this here until now because I wanted to test it more. I will keep the issue open for some time more while I confirm that issue is actually gone, but so far it looks solved to me.

I also have another machine with very similar awesome and OS configuration that is suffering the same problem, so I have an opportunity to double-check the fix.

ivan-kolmychek avatar Aug 24 '19 18:08 ivan-kolmychek

@ivan-kolmychek So, assuming this is now "fixed" (well, worked-around) for you: Do I understand correctly that you get a memory leak from just having a timer that stops and re-starts itself in your config? The only reason why I asked for you to try out without that timer is that you said that memory usage also keeps increasing when you are not using your computer.

Hm... This might be one of those things that just happen due to some "weird" interaction between LGI and Lua's garbage collector. I do not have time right now to try to reproduce this and if it is really the problem that I suppose, then there is not much we can do to actually fix it (this is just how Lua's GC works...).

Another work-around that keeps the functionality would be to give the garbage collector a little kick in your timer: Call collectgarbage("step", 1234) where 1234 is just a number you keep increasing until the issue disappears. This is basically saying "hey Lua GC, do some steps pretending that 1234 KiB (?) of memory where just allocated". Because of this, Lua's GC would be faster and would manage to collect garbage faster than it appears.

Is this okay with you?

psychon avatar Aug 25 '19 07:08 psychon

@psychon it seems so. I'm not sure what exactly is the cause, I actually suspected it allocates memory for each new wallpaper and doesn't remove the old ones from memory. I think I can try to confirm it by just leaving the timer and commenting out the call to set_wallpaper() inside.

Thank you for the suggestion on collectgarbadge(), I can test it out after that and see will it solve the issue as well.

As for the previous workaround, I've just set it up on the second machine I've mentioned so I can see is it fixed there too, just for the sake of confirmation.

ivan-kolmychek avatar Aug 26 '19 07:08 ivan-kolmychek

@psychon I've tested out the version with just set_wallpaper() commented and it seems that issue isn't in the timer itself, as I still see no memory leaks even if machine is up and running for many hours:

[ivanko] ~
$ date; ps -eo comm,rss | grep awesome
Mon 26 Aug 2019 10:20:17 AM EEST
awesome         60788
[ivanko] ~
$ date; ps -eo comm,rss | grep awesome
Mon 26 Aug 2019 09:09:27 PM EEST
awesome         57376

Hopefully I can try tomorrow does the fix you have suggested works with original setup.

ivan-kolmychek avatar Aug 26 '19 18:08 ivan-kolmychek

@psychon tested it out, it seems to solve the issue. To make the test more extreme I've made it to change wallpaper each second - without the collectgarbage("step", 4000) it gobbles up RAM at incredible pace, eating all of it in minutes, but with it RAM usage stays around the same number for hours, with or without activity.

Thank you for the help. I'm marking the issue as closed.

ivan-kolmychek avatar Aug 27 '19 17:08 ivan-kolmychek

@ivan-kolmychek May you please provide the complete lua code exerpt you have used to create timer for "collectgarbage" so that I can test it too.

ymartin59 avatar Aug 28 '19 05:08 ymartin59

I've tested out the version with just set_wallpaper() commented and it seems that issue isn't in the timer itself, as I still see no memory leaks even if machine is up and running for many hours:

Doesn't that indicate a problem with set_wallpaper then? (re-opening)

blueyed avatar Aug 28 '19 20:08 blueyed

Finally had some time to test this. Appended this to the default config:

gears.timer.start_new(0.1, function()
    local wallpaper = beautiful.wallpaper
    -- If wallpaper is a function, call it with the screen
    if type(wallpaper) == "function" then
        wallpaper = wallpaper(s)
    end
    gears.wallpaper.maximized(wallpaper, screen[1], true)
    return true
end)

valgrind --tool=massif says:

99.25% (375,154,070B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->94.87% (358,612,992B) 0x4886BE2: gdk_pixbuf_new (in /usr/lib/x86_64-linux-gnu/libgdk_pixbuf-2.0.so.0.3800.1)
| ->94.87% (358,612,992B) 0x96E813F: ??? (in /usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so)
|   ->94.87% (358,612,992B) 0x488BCF1: gdk_pixbuf_new_from_file (in /usr/lib/x86_64-linux-gnu/libgdk_pixbuf-2.0.so.0.3800.1)
|     ->94.87% (358,612,992B) 0x5F238ED: ffi_call_unix64 (in /usr/lib/x86_64-linux-gnu/libffi.so.6.0.4)
|       ->94.87% (358,612,992B) 0x5F232BE: ffi_call (in /usr/lib/x86_64-linux-gnu/libffi.so.6.0.4)
|         ->94.87% (358,612,992B) 0x7CF9880: ??? (in /usr/lib/x86_64-linux-gnu/liblua5.3-lgi.so.0.0.0)
|           ->94.87% (358,612,992B) 0x56DE584: luaD_precall.part.3 (ldo.c:365)
[...]

So... seems like something is keeping pixbufs alive, even though gears.wallpaper loads them with gears.surface.load_uncached...

Edit: Well... this one is easy. This regression was introduced with ee944f4da8 (part of v4.3). Previously, the code just loaded a cairo surface through the C code and the C code made sure that the Pixbuf that it internally allocated was disposed. Now, Lua uses GdkPixbuf directly and it is the garbage collector's job to get rid of the Pixbuf. And the GC does not see the memory usage for the actual image data.

Edit: Sigh. This can only be fixed by bumping our minimum LGI requirement to version 0.9.2. In this version, the hack getmetatable(obj).__gc(obj) works to "force-kill" an object. In earlier versions, this will cause a crash once the GC comes around LGI tries to GC the object again. However, this hack has a good chance of breaking as well: LGI stilll remembers the pointer behind obj and if another object of the same type is later allocated at the same address, LGI will use the same Lua-wrapper-object again for this new object.

psychon avatar Aug 30 '19 16:08 psychon

Now, Lua uses GdkPixbuf directly and it is the garbage collector's job to get rid of the Pixbuf. And the GC does not see the memory usage for the actual image data.

Can it be somehow be hinted for the GC? Is this something lgi could do/handle?

blueyed avatar Aug 30 '19 16:08 blueyed

Can it be somehow be hinted for the GC?

The docs for collectgarbage("step") in Lua 5.3 say:

"step": performs a garbage-collection step. The step "size" is controlled by arg. With a zero value, the collector will perform one basic (indivisible) step. For non-zero values, the collector will perform as if that amount of memory (in KBytes) had been allocated by Lua. Returns true if the step finished a collection cycle.

So gears.surface could compute the size of the surface in KBytes and tell the GC about it via collectgarbage("step", size_in_kbytes).

No idea what effects this will have, because this memory is not part of the Lua-managed heap. It is only kept alive through Lua. For example, what if the GC sees "you claim to have allocated 50 MiB of memory, but the Lua heap is only 10 MiB big, so you must be lying". No idea what kind of things happen here...

Is this something lgi could do/handle?

LGI does not see the memory of the actual pixel data either. It could only tell Lua about sizeof(GdkPixbuf), which is not really useful, I think.

psychon avatar Aug 30 '19 16:08 psychon

This seems to work here, but I guess it leads to lots of GC overhead, because the GC is kicked way too much. :-(

diff --git a/lib/gears/surface.lua b/lib/gears/surface.lua
index b85c7de19..dc8bd3a08 100644
--- a/lib/gears/surface.lua
+++ b/lib/gears/surface.lua
@@ -54,6 +54,12 @@ function surface.load_uncached_silently(_surface, default)
         end
         _surface = capi.awesome.pixbuf_to_surface(pixbuf._native, _surface)
 
+        -- We would like for "pixbuf" to be freed now to save memory. Since we
+        -- cannot do that, we try to give Lua's GC a little (or big) kick
+        pixbuf_size = pixbuf:get_byte_length()
+        pixbuf = nil
+        collectgarbage("step", math.ceil(pixbuf_size / 1024))
+
         -- The shims implement load_image() to return a surface directly,
         -- instead of a lightuserdatum.
         if cairo.Surface:is_type_of(_surface) then

Would be great if one could just tell LGI "unreference this object now".

Edit: Some print says that the wallpaper being loaded has 9216 KiB while Lua's heap has a total size of 3730 KiB. So yes, this really is a lot more than the actual Lua heap size and thus likely causes a full GC cycle on each call...

psychon avatar Aug 30 '19 17:08 psychon

and thus likely causes a full GC cycle on each call...

How is it freed then though? While you tell it to free X KiB I've thought it does not know about it in the first place, and therefore also cannot free it?

Would be great if one could just tell LGI "unreference this object now".

But this implies that it is known already.

blueyed avatar Aug 30 '19 17:08 blueyed

Lua sees a small-ish userdata object (too lazy to check right now, so let us say that it has 24 bytes). So, Lua counts this as 24 bytes memory usage. This userdata object then has a pointer to a GdkPixbuf. This one was allocated via GLib's memory allocation functions. Some member of this struct then contains a pointer to the actual pixel data.

When I tell Lua's GC to free X KiB, it tries to free X KiB of Lua memory. This will most likely free the userdata object that LGI created. In its __gc function it will decrease the reference count of the GdkPixbuf structure. When that reference count reaches zero, the actual pixel data of the image is freed.

I had another idea on how to fix this regression: Revert the commit that introduced it.

diff --git a/build-utils/lgi-check.c b/build-utils/lgi-check.c
index c0e3bfcd2..9d7a2261e 100644
--- a/build-utils/lgi-check.c
+++ b/build-utils/lgi-check.c
@@ -35,7 +35,7 @@ const char commands[] =
 "        '0.8.0', require('lgi.version')))\n"
 "end\n"
 "lgi = require('lgi')\n"
-"assert(lgi.cairo, lgi.Pango, lgi.PangoCairo, lgi.GLib, lgi.Gio, lgi.GdkPixbuf)\n"
+"assert(lgi.cairo, lgi.Pango, lgi.PangoCairo, lgi.GLib, lgi.Gio)\n"
 ;
 
 int main()
diff --git a/lib/gears/surface.lua b/lib/gears/surface.lua
index b85c7de19..89dbfdd3b 100644
--- a/lib/gears/surface.lua
+++ b/lib/gears/surface.lua
@@ -8,7 +8,6 @@ local setmetatable = setmetatable
 local type = type
 local capi = { awesome = awesome }
 local cairo = require("lgi").cairo
-local GdkPixbuf = require("lgi").GdkPixbuf
 local color = nil
 local gdebug = require("gears.debug")
 local hierarchy = require("wibox.hierarchy")
@@ -38,6 +37,7 @@ end
 -- @return An error message, or nil on success.
 -- @staticfct load_uncached_silently
 function surface.load_uncached_silently(_surface, default)
+    local file
     -- On nil, return some sane default
     if not _surface then
         return get_default(default)
@@ -48,11 +48,12 @@ function surface.load_uncached_silently(_surface, default)
     end
     -- Strings are assumed to be file names and get loaded
     if type(_surface) == "string" then
-        local pixbuf, err = GdkPixbuf.Pixbuf.new_from_file(_surface)
-        if not pixbuf then
-            return get_default(default), tostring(err)
+        local err
+        file = _surface
+        _surface, err = capi.awesome.load_image(file)
+        if not _surface then
+            return get_default(default), err
         end
-        _surface = capi.awesome.pixbuf_to_surface(pixbuf._native, _surface)
 
         -- The shims implement load_image() to return a surface directly,
         -- instead of a lightuserdatum.

psychon avatar Aug 30 '19 18:08 psychon

May you please provide the complete lua code exerpt you have used to create timer for "collectgarbage" so that I can test it too.

@ymartin59 here is the commit with workaround. Please excuse that config may not be up-to-date with current default config, I've carried it over from version to version for some time by now, so I think it's long overdue for some cleanup (as in "take default config, re-adapt it based on old one, throw away old one" :slightly_smiling_face:).

Doesn't that indicate a problem with set_wallpaper then? (re-opening)

@blueyed given what you've discussed after that comment with @psychon, it seems the cause of issue was identified now. Unfortunately, I think I'm not familiar enough with Lua and its internals to contribute much to the discussion, so I should probably refrain from further comments to not interfere.

ivan-kolmychek avatar Aug 31 '19 08:08 ivan-kolmychek

I'm also experiencing this problem, but I also use a static wallpaper. Anything I can do to help debug this?

crides avatar May 07 '20 08:05 crides

I am no longer experiencing this memory leak on Debian Testing 4.3-4 version. As I do not refresh wallpaper, it is possible this failure comes from my (previous) regular usage of VMware Workstation Player.

ymartin59 avatar May 31 '20 19:05 ymartin59

awesome v4.3 (Too long) • Compiled against Lua 5.3.6 (running with Lua 5.3) • D-Bus support: ✔ • execinfo support: ✔ • xcb-randr version: 1.6 • LGI version: 0.9.2

I added to my rc.lua awful.widget.watch() that runs my shell script and it seems that it leaks memory each time it runs it. I tested it with different timeouts and it seems that each time when widget runs my command it leak some memory. I added timer in the start of rc.lua that call collectgarbage("step", 50) and it seems that it helps, but i still don't understand wtf is happening.

prodisDown avatar Nov 27 '21 02:11 prodisDown