apisix
apisix copied to clipboard
bug: openresty child process cpu user space occupies 100%
Current Behavior
openresty其中一个子进程cpu100%,使用ltrace追踪该异常进程,结果如下: ... lua_gettop(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 1 lua_getexdata(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 0x91a650 luaL_checktype(0x7fda9c61f508, 1, 5, 0x7fda9c0e2c10) = 5 lua_rawgeti(0x7fda9c61f508, 1, 1, 44) = 0x7fda9c0e2b78 lua_touserdata(0x7fda9c61f508, 0xffffffff, 0xfff9ffda9c16b090, 44) = 0x7fda9c16b0c0 lua_pushnil(0x7fda9c61f508, -1, 69, -13) = 0x7fda9c0e2b80 lua_pushlstring(0x7fda9c61f508, 0x54cf9b, 6, -13) = 0xfffdffdaa9302ff0 lua_gettop(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 1 lua_getexdata(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 0x91a650 luaL_checktype(0x7fda9c61f508, 1, 5, 0x7fda9c0e2c10) = 5 lua_rawgeti(0x7fda9c61f508, 1, 1, 44) = 0x7fda9c0e2b78 lua_touserdata(0x7fda9c61f508, 0xffffffff, 0xfff9ffda9c16b090, 44) = 0x7fda9c16b0c0 lua_pushnil(0x7fda9c61f508, -1, 69, -13) = 0x7fda9c0e2b80 lua_pushlstring(0x7fda9c61f508, 0x54cf9b, 6, -13) = 0xfffdffdaa9302ff0 ...
Expected Behavior
No response
Error Logs
No response
Steps to Reproduce
1.apisix是源码编译,部署在centos 2.不稳定复现,apisix运行两个月左右就出现一个nginx子进程cpu占用100%。
Environment
- APISIX version (run
apisix version
): 2.9 - Operating system (run
uname -a
): Linux test-181-102-abjc 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux - OpenResty / Nginx version (run
openresty -V
ornginx -V
): nginx version: openresty/1.19.9.1 - etcd version, if relevant (run
curl http://127.0.0.1:9090/v1/server_info
): 3.4.0 - APISIX Dashboard version, if relevant: 2.7
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run
luarocks --version
): /usr/local/bin/luarocks 3.4.0
It looks like the rawget
is caused dead loop.
Can you provide more information?
You can use lbt to get the Lua stack: https://github.com/spacewander/openresty-gdb-utils/#lbt
hi, spacewander.
After i type lbt
, it return "No global L located (tried globalL and ngx_cycle)".
The details:
[root@ricky apisix-debug]# ps aux|grep nginx
root 1748 0.0 0.0 27660 1584 ? Ss 14:14 0:00 nginx: master process openresty -p /opt/apisix-debug
nobody 1749 0.0 0.0 28108 2336 ? S 14:14 0:00 nginx: worker process
root 1751 0.0 0.0 112812 980 pts/0 R+ 14:14 0:00 grep --color=auto nginx
[root@ricky apisix-debug]# gdb -p 1749
GNU gdb (GDB) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 1749
Reading symbols from /usr/local/openresty/nginx/sbin/nginx...(no debugging symbols found)...done.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/pcre/lib/libpcre.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/openssl111/lib/libssl.so.1.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/openssl111/lib/libcrypto.so.1.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/zlib/lib/libz.so.1...(no debugging symbols found)...done.
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libfreebl3.so...Reading symbols from .gnu_debugdata for /lib64/libfreebl3.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
0x00007f143fac20c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) source luajit21.py
(gdb) lbt
No global L located (tried globalL and ngx_cycle)
What should i do?
added ~/.gdbinit configuration:
directory /opt/apisix-debug/openresty-gdb-utils-lbt_gc64
py import sys
py sys.path.append("/opt/apisix-debug/openresty-gdb-utils-lbt_gc64")
source luajit20.gdb
source ngx-lua.gdb
source luajit21.py
source ngx-raw-req.py
set python print-stack full
I try to install openresty-debuginfo-1.19.9.1-1.el7 but lbt return nothing:
(gdb) source luajit21.py
(gdb) lbt
(gdb)
(gdb)
What does bt
output?
(gdb) bt
#0 0x00007fc2124b70c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x000000000044ce08 in ngx_epoll_process_events (cycle=0x20acee0, timer=100, flags=1) at src/event/modules/ngx_epoll_module.c:800
#2 0x0000000000443c65 in ngx_process_events_and_timers (cycle=0x20acee0) at src/event/ngx_event.c:257
#3 0x000000000044b192 in ngx_worker_process_cycle (cycle=cycle@entry=0x20acee0, data=data@entry=0x1) at src/os/unix/ngx_process_cycle.c:782
#4 0x0000000000449977 in ngx_spawn_process (cycle=0x20acee0, proc=0x44b120 <ngx_worker_process_cycle>, data=0x1, name=0x550323 "worker process", respawn=-3) at src/os/unix/ngx_process.c:199
#5 0x000000000044a6b4 in ngx_start_worker_processes (cycle=cycle@entry=0x20acee0, n=8, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:382
#6 0x000000000044bb4c in ngx_master_process_cycle (cycle=0x20acee0) at src/os/unix/ngx_process_cycle.c:135
#7 0x0000000000423cb0 in main (argc=
You may need to generate a C level on-CPU framegraph. __epoll_wait_nocancel
is a blocking syscall, a process can't have 100% CPU usage in user space when it is blocked on this.
Here is the CI related to generating framegraph: https://github.com/apache/apisix/blob/master/.github/workflows/performance.yml
I want to make sure of one question: do you deploy private code code, such as custom plugins?
And you also need to provide some configuration information. Since you are using APISIX version 2.9, we have fixed some CPU usage 100% bugs after 2.9, so your information will help us to determine if it is a fixed issue.
You may need to generate a C level on-CPU framegraph.
__epoll_wait_nocancel
is a blocking syscall, a process can't have 100% CPU usage in user space when it is blocked on this.
@spacewander the bt
message just from a nomal process. 100% CPU usage child process is killed because i need to recover the production enviroment. I don't know what things cause 100% CPU, so i want to find someways to debug the exception when it happens.
I want to make sure of one question: do you deploy private code code, such as custom plugins?
@tzssangglass yes, i have a custom plugin. The plugin use redis subscribe
and ngx.share
. The code can be see below:
local redis_new = require("resty.redis").new
local core = require("apisix.core")
local setmetatable = setmetatable
local tostring = tostring
local _M = {version = 0.1}
local mt = {
__index = _M
}
function _M.new(plugin_name, conf)
local self = {conf = conf, plugin_name = plugin_name}
return setmetatable(self, mt)
end -- redis conf
local function red_conn(conf)
local red = redis_new()
local timeout = conf.redis_timeout or 1000
red:set_timeouts(timeout, timeout, timeout)
local ok, err = red:connect(conf.redis_host, conf.redis_port or 6379) -- redis conn
if not ok then
return false, err
end
local count
count, err = red:get_reused_times()
if 0 == count then
if conf.redis_password and conf.redis_password ~= '' then
local ok, err = red:auth(conf.redis_password) -- redis auth
if not ok then
return nil, err
end
end
elseif err then
return nil, err
end
return red, false
end
function _M.token(self, key)
local auth_token = ngx.shared["auth_token"]
if auth_token == ngx.null then
return false, 'index auth_token error.'
end
if auth_token:get(key) then
return true, nil -- token vaild, return true
end
local conf = self.conf
local red, err = red_conn(conf) -- redis conn
if err then
return false, err
end
core.log.info("***connect to redis ***")
local ok, err = red:select(conf.redis_db)
if not ok then
return false, err
end
local ret, err = red:exists(key)
if not ret then
return false, err
end
local ttl, err = red:ttl(key)
if ttl == -1 then
ttl = 604800
end
ok, err = red:set_keepalive(10000, 100) -- keep redis conn
if not ok then
return false, err
end
if ret == 0 then
return false, "rejected" -- token not exists ,return reject
end
auth_token:set(key,'',ttl) -- set shared memory
return true, ret -- token vaild, return true
end
function _M.subscribe(self, func)
local co = coroutine.create(function()
local conf = self.conf
local red, err = red_conn(conf)
if err then
core.log.error("failed to conn redis: ", err)
end
local ok, err = red:subscribe('__keyevent@'..conf.redis_db..'__:del') -- subscribe delete command
if not ok then
core.log.error("failed to create subscribe: ", err)
end
red:subscribe('__keyevent@'..conf.redis_db..'__:expired') -- subscribe expired command
while true do
local ret,err = red:read_reply()
if err then
core.log.info("failed to read_reply at subscribe: ", err)
else
func(ret) -- update shared memeory
end
red:set_keepalive(100, 100) -- keep redis conn
end
end)
coroutine.resume(co)
end
return _M
added config.yaml
configuration:
...
lua_shared_dicts: # add custom shared cache to nginx.conf
auth_token: 10m
...
redis:
redis_host: "xx"
redis_port: 6379
redis_password: "xx"
redis_db: 1
redis_timeout: 3000
...
If you can generate a flamgraph at 100% CPU, the problem would be obvious.
I captured the perf.data using perf at 100% CPU, maybe it can provide some helpful message?
perf record -g -p 21515 -- sleep 30
and then,
git clone https://github.com/brendangregg/FlameGraph
mv perf.data ./FlameGraph
perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > flame.svg
Here is flamgraph:
This is a useless flamgraph without any valid stack information. Perhaps you are using perf incorrectly, or you are running APISIX and apisix-base with missing debug information (the official releases of APISIX and apisix-base do have debug information).
This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.