apisix icon indicating copy to clipboard operation
apisix copied to clipboard

bug: openresty child process cpu user space occupies 100%

Open chenqiaorui opened this issue 2 years ago • 16 comments

Current Behavior

openresty其中一个子进程cpu100%,使用ltrace追踪该异常进程,结果如下: ... lua_gettop(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 1 lua_getexdata(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 0x91a650 luaL_checktype(0x7fda9c61f508, 1, 5, 0x7fda9c0e2c10) = 5 lua_rawgeti(0x7fda9c61f508, 1, 1, 44) = 0x7fda9c0e2b78 lua_touserdata(0x7fda9c61f508, 0xffffffff, 0xfff9ffda9c16b090, 44) = 0x7fda9c16b0c0 lua_pushnil(0x7fda9c61f508, -1, 69, -13) = 0x7fda9c0e2b80 lua_pushlstring(0x7fda9c61f508, 0x54cf9b, 6, -13) = 0xfffdffdaa9302ff0 lua_gettop(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 1 lua_getexdata(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 0x91a650 luaL_checktype(0x7fda9c61f508, 1, 5, 0x7fda9c0e2c10) = 5 lua_rawgeti(0x7fda9c61f508, 1, 1, 44) = 0x7fda9c0e2b78 lua_touserdata(0x7fda9c61f508, 0xffffffff, 0xfff9ffda9c16b090, 44) = 0x7fda9c16b0c0 lua_pushnil(0x7fda9c61f508, -1, 69, -13) = 0x7fda9c0e2b80 lua_pushlstring(0x7fda9c61f508, 0x54cf9b, 6, -13) = 0xfffdffdaa9302ff0 ...

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

1.apisix是源码编译,部署在centos 2.不稳定复现,apisix运行两个月左右就出现一个nginx子进程cpu占用100%。

Environment

  • APISIX version (run apisix version): 2.9
  • Operating system (run uname -a): Linux test-181-102-abjc 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • OpenResty / Nginx version (run openresty -V or nginx -V): nginx version: openresty/1.19.9.1
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 3.4.0
  • APISIX Dashboard version, if relevant: 2.7
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version): /usr/local/bin/luarocks 3.4.0

chenqiaorui avatar Jul 27 '22 10:07 chenqiaorui

It looks like the rawget is caused dead loop.

Can you provide more information?

tzssangglass avatar Jul 27 '22 11:07 tzssangglass

You can use lbt to get the Lua stack: https://github.com/spacewander/openresty-gdb-utils/#lbt

spacewander avatar Jul 28 '22 03:07 spacewander

hi, spacewander. After i type lbt, it return "No global L located (tried globalL and ngx_cycle)".

The details:

[root@ricky apisix-debug]# ps aux|grep nginx
root       1748  0.0  0.0  27660  1584 ?        Ss   14:14   0:00 nginx: master process openresty -p /opt/apisix-debug
nobody     1749  0.0  0.0  28108  2336 ?        S    14:14   0:00 nginx: worker process
root       1751  0.0  0.0 112812   980 pts/0    R+   14:14   0:00 grep --color=auto nginx
[root@ricky apisix-debug]# gdb -p 1749
GNU gdb (GDB) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 1749
Reading symbols from /usr/local/openresty/nginx/sbin/nginx...(no debugging symbols found)...done.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/pcre/lib/libpcre.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/openssl111/lib/libssl.so.1.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/openssl111/lib/libcrypto.so.1.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/zlib/lib/libz.so.1...(no debugging symbols found)...done.
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libfreebl3.so...Reading symbols from .gnu_debugdata for /lib64/libfreebl3.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
0x00007f143fac20c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) source luajit21.py
(gdb) lbt
No global L located (tried globalL and ngx_cycle)

What should i do?

added ~/.gdbinit configuration:

directory /opt/apisix-debug/openresty-gdb-utils-lbt_gc64
  
py import sys
py sys.path.append("/opt/apisix-debug/openresty-gdb-utils-lbt_gc64")

source luajit20.gdb
source ngx-lua.gdb
source luajit21.py
source ngx-raw-req.py
set python print-stack full

chenqiaorui avatar Aug 08 '22 06:08 chenqiaorui

I try to install openresty-debuginfo-1.19.9.1-1.el7 but lbt return nothing:

(gdb) source luajit21.py
(gdb) lbt
(gdb) 
(gdb)

chenqiaorui avatar Aug 08 '22 07:08 chenqiaorui

What does bt output?

spacewander avatar Aug 08 '22 07:08 spacewander

(gdb) bt #0 0x00007fc2124b70c3 in __epoll_wait_nocancel () from /lib64/libc.so.6 #1 0x000000000044ce08 in ngx_epoll_process_events (cycle=0x20acee0, timer=100, flags=1) at src/event/modules/ngx_epoll_module.c:800 #2 0x0000000000443c65 in ngx_process_events_and_timers (cycle=0x20acee0) at src/event/ngx_event.c:257 #3 0x000000000044b192 in ngx_worker_process_cycle (cycle=cycle@entry=0x20acee0, data=data@entry=0x1) at src/os/unix/ngx_process_cycle.c:782 #4 0x0000000000449977 in ngx_spawn_process (cycle=0x20acee0, proc=0x44b120 <ngx_worker_process_cycle>, data=0x1, name=0x550323 "worker process", respawn=-3) at src/os/unix/ngx_process.c:199 #5 0x000000000044a6b4 in ngx_start_worker_processes (cycle=cycle@entry=0x20acee0, n=8, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:382 #6 0x000000000044bb4c in ngx_master_process_cycle (cycle=0x20acee0) at src/os/unix/ngx_process_cycle.c:135 #7 0x0000000000423cb0 in main (argc=, argv=) at src/core/nginx.c:386 (gdb)

chenqiaorui avatar Aug 08 '22 07:08 chenqiaorui

You may need to generate a C level on-CPU framegraph. __epoll_wait_nocancel is a blocking syscall, a process can't have 100% CPU usage in user space when it is blocked on this.

spacewander avatar Aug 08 '22 12:08 spacewander

Here is the CI related to generating framegraph: https://github.com/apache/apisix/blob/master/.github/workflows/performance.yml

tzssangglass avatar Aug 08 '22 16:08 tzssangglass

I want to make sure of one question: do you deploy private code code, such as custom plugins?

tzssangglass avatar Aug 08 '22 16:08 tzssangglass

And you also need to provide some configuration information. Since you are using APISIX version 2.9, we have fixed some CPU usage 100% bugs after 2.9, so your information will help us to determine if it is a fixed issue.

tzssangglass avatar Aug 08 '22 16:08 tzssangglass

You may need to generate a C level on-CPU framegraph. __epoll_wait_nocancel is a blocking syscall, a process can't have 100% CPU usage in user space when it is blocked on this.

@spacewander the bt message just from a nomal process. 100% CPU usage child process is killed because i need to recover the production enviroment. I don't know what things cause 100% CPU, so i want to find someways to debug the exception when it happens.

chenqiaorui avatar Aug 09 '22 02:08 chenqiaorui

I want to make sure of one question: do you deploy private code code, such as custom plugins?

@tzssangglass yes, i have a custom plugin. The plugin use redis subscribe and ngx.share. The code can be see below:

local redis_new = require("resty.redis").new
local core = require("apisix.core")
local setmetatable = setmetatable
local tostring = tostring

local _M = {version = 0.1}

local mt = {
    __index = _M
}

function _M.new(plugin_name, conf)
    local self = {conf = conf, plugin_name = plugin_name}
    return setmetatable(self, mt)
end -- redis conf

local function red_conn(conf)
    local red = redis_new() 
    local timeout = conf.redis_timeout or 1000   

    red:set_timeouts(timeout, timeout, timeout) 

    local ok, err = red:connect(conf.redis_host, conf.redis_port or 6379) -- redis conn
    if not ok then
        return false, err
    end

    local count
    count, err = red:get_reused_times()
    if 0 == count then
        if conf.redis_password and conf.redis_password ~= '' then
            local ok, err = red:auth(conf.redis_password) -- redis auth
            if not ok then
                return nil, err
            end
        end
    elseif err then
        return nil, err
    end
    return red, false
end

function _M.token(self, key)
    local auth_token = ngx.shared["auth_token"] 
    if auth_token == ngx.null then
        return false, 'index auth_token error.'
    end

    if auth_token:get(key) then      
		return true, nil -- token vaild, return true
    end

    local conf = self.conf
    local red, err = red_conn(conf) -- redis conn
    if err then
        return false, err
    end
  
    core.log.info("***connect to redis ***")
    local ok, err = red:select(conf.redis_db)
    if not ok then
        return false, err
    end
    local ret, err = red:exists(key) 
    
    if not ret then
        return false, err 
    end
    local ttl, err = red:ttl(key)
    if ttl == -1 then
        ttl = 604800
    end

    ok, err = red:set_keepalive(10000, 100) -- keep redis conn
    if not ok then
        return false, err
    end

    if ret == 0 then
        return false, "rejected" -- token not exists ,return reject
    end

    auth_token:set(key,'',ttl) -- set shared memory

    return true, ret -- token vaild, return true
end

function _M.subscribe(self, func)
    local co = coroutine.create(function()
        local conf = self.conf
        local red, err = red_conn(conf)
        if err then
            core.log.error("failed to conn redis: ", err)
        end
        
        local ok, err = red:subscribe('__keyevent@'..conf.redis_db..'__:del') -- subscribe delete command
        if not ok then
            core.log.error("failed to create subscribe: ", err)
        end
        
        red:subscribe('__keyevent@'..conf.redis_db..'__:expired') -- subscribe expired command
        while true do
            local ret,err = red:read_reply()
            if err then
                core.log.info("failed to read_reply at subscribe: ", err)
            else
                func(ret) -- update shared memeory
            end
            red:set_keepalive(100, 100) -- keep redis conn
        end
    end)
    coroutine.resume(co) 
end

return _M

chenqiaorui avatar Aug 09 '22 02:08 chenqiaorui

added config.yaml configuration:

...
lua_shared_dicts:             # add custom shared cache to nginx.conf
      auth_token: 10m
...
redis:
  redis_host: "xx"
  redis_port: 6379
  redis_password: "xx"
  redis_db: 1
  redis_timeout: 3000
...

chenqiaorui avatar Aug 09 '22 03:08 chenqiaorui

If you can generate a flamgraph at 100% CPU, the problem would be obvious.

tzssangglass avatar Aug 09 '22 06:08 tzssangglass

I captured the perf.data using perf at 100% CPU, maybe it can provide some helpful message?

perf record -g -p 21515 -- sleep 30   

and then,

git clone https://github.com/brendangregg/FlameGraph
mv perf.data ./FlameGraph
perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > flame.svg

Here is flamgraph: flame

chenqiaorui avatar Aug 09 '22 07:08 chenqiaorui

This is a useless flamgraph without any valid stack information. Perhaps you are using perf incorrectly, or you are running APISIX and apisix-base with missing debug information (the official releases of APISIX and apisix-base do have debug information).

tzssangglass avatar Aug 09 '22 15:08 tzssangglass

This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Jul 26 '23 10:07 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

github-actions[bot] avatar Aug 09 '23 10:08 github-actions[bot]