lua-resty-lock icon indicating copy to clipboard operation
lua-resty-lock copied to clipboard

memory leak under high GC pressure

Open fffonion opened this issue 1 year ago • 1 comments

When using lua-resty-lock to create and dispose large amount of locks will cause the internal memo table behave incorrectly.

We injected some debugging code in lock.lua and output the "leaked" objects in the memo table:

--- /usr/local/openresty/lualib/resty/lock.lua	2022-07-27 21:12:38.000000000 -0700
+++ resty_lock.lua	2022-07-05 04:22:57.000000000 -0700
@@ -25,6 +25,37 @@
 local memo = {}
 if debug then _M.memo = memo end

+local l = ngx.now()
+local nkeys = require "table.nkeys"
+local function dd()
+        if ngx.now() - l <= 1 then
+                return
+        end
+        l = ngx.now()
+        local total = nkeys(memo)
+        local used = 0
+        for _, k in pairs(memo) do
+                if type(k) ~= "number" then
+                        used = used+ 1
+                end
+        end
+
+        local unused = 0
+        local ii = FREE_LIST_REF
+        while unused < total do
+                local p = memo[ii]
+                if type(p) == "number" then
+                        unused = unused + 1
+                else
+                        break
+                end
+                ii = p
+        end
+
+
+        ngx.log(ngx.ERR, " --------- keys: ", total, " used: ", used, " unused: ", unused, " leak: ", total - used - unused)
+end
+

 local function ref_obj(key)
     if key == nil then
@@ -36,8 +67,17 @@

     else
         ref = #memo + 1
+
+        if ref-1 < nkeys(memo) then
+            ngx.log(ngx.ERR, "########### table tail ", ref-1, " is smaller than nkeys(memo) ", nkeys(memo), "??????")
+        end
     end
+    if type(key) == "number" then
+        ngx.log(ngx.ERR, "########### got a number ", key, " ", debug.traceback())
+    end
+
     memo[ref] = key
+    dd()

     -- print("ref key_id returned ", ref)
     return ref
@@ -219,3 +259,4 @@


 return _M
+

create a test script:

local lock = require "resty_lock"
local t = 0
while true do
  local locks = {}

  t = t + 1
  for i=1, 5000 do
    locks[i] = lock:new("locks")
    locks[i]:lock(t .. "a" .. i)
    locks[i]:unlock(t .. "a" .. i)
  end
  ngx.sleep(0.1)
end

and a file to inject shm directive:

$ cat lock_include.conf
lua_shared_dict locks                  8m;

Then run with resty --http-include lock_include.conf test.lua, you will observe leak: increases, sometimes very slow, sometime very fast; in one cases i'm able to observe 10k keys leaked per second:

$ resty --http-include lock_include.conf lock4.lua
2022/07/27 22:20:19 [error] 3067000#0: *2 [lua] resty_lock.lua:56: dd():  --------- keys: 22301 used: 10001 unused: 0 leak: 12300, context: ngx.timer
2022/07/27 22:20:20 [error] 3067000#0: *2 [lua] resty_lock.lua:56: dd():  --------- keys: 36093 used: 10001 unused: 2313 leak: 23779, context: ngx.timer
2022/07/27 22:20:21 [error] 3067000#0: *2 [lua] resty_lock.lua:56: dd():  --------- keys: 46141 used: 10001 unused: 0 leak: 36140, context: ngx.timer

(For convience I attached the codes used above in an archive lock_leak.tgz)

In production environment we also see this leak happens continously until it hits LuaJIT table limit and throws table overflow error.

We also investigated the un-released keys in memo and found all of them are the refs to the shdict object created at https://github.com/openresty/lua-resty-lock/blob/master/lib/resty/lock.lua#L90. It seems that when ref_obj is called (https://github.com/openresty/lua-resty-lock/blob/master/lib/resty/lock.lua#L33-L40) and memory allocation may happen on new key, and LuaJIT may jump into GC and thus calls unref_obj, then makes the link in the memo table broken.

We have came up with two approaches:

  1. to use the undocumented newproxy API to replace ref/unref (https://github.com/Kong/lua-resty-lock/pull/1), we observe 20% performance drop on the lock/unlock API calls
  2. to reduce the number of refs to shdict object (https://github.com/Kong/lua-resty-lock/pull/2), this has no performance penalty, but may not fix the root cause of problem

fffonion avatar Jul 28 '22 05:07 fffonion

Thanks for your detailed analysis!

Personally, I will vote for solution one.

As for solution two, it reduces the number of unref_obj calls with dict_id, but doesn't reduce the number of unref_obj with key_id. So the problem still exists but less frequently. Or may take more time to reach the table overflow error.

spacewander avatar Jul 29 '22 09:07 spacewander