skynet
skynet copied to clipboard
热更新lua代码导致skynet工作线程死锁
线上游戏服,今早玩家反馈卡顿,上服务器查看情况,发现cpu、负载很低,查看日志发现某服务出现大量的endless,并且当前服务没有任何其他日志输出(其他服务正常)。基于这些情况,执行 “pstack PID“ 查看skynet进程的线程情况发现 Thread 2 异常,具体如下:
Thread 2 (Thread 0x7f19cf5f7700 (LWP 27881)):
0 0x00007f19dad6989c in __lll_lock_wait_private () from /lib64/libc.so.6
1 0x00007f19dacdb45d in _L_lock_121 () from /lib64/libc.so.6
2 0x00007f19dacd9023 in __GI__IO_un_link () from /lib64/libc.so.6
3 0x00007f19dacd7fe8 in __GI__IO_file_close_it () from /lib64/libc.so.6
4 0x00007f19dacd55a9 in freopen64 () from /lib64/libc.so.6
5 0x00000000004263b9 in luaL_loadfilex_ (L=L@entry=0x7f189487af88, filename=filename@entry=0x7f18ace7bac0 "scripts/apis/army.lua", mode=mode@entry=0x7f19bb5d5be0 "bt") at lauxlib.c:794
6 0x00000000004271fa in luaL_loadfilex (L=L@entry=0x7f189487af88, filename=filename@entry=0x7f18ace7bac0 "scripts/apis/army.lua", mode=mode@entry=0x7f19bb5d5be0 "bt") at lauxlib.c:1232
7 0x000000000042b3d6 in luaB_loadfile (L=0x7f189487af88) at lbaselib.c:322
8 0x0000000000415763 in luaD_precall (L=L@entry=0x7f189487af88, func=func@entry=0x7f17a5598360, nresults=2) at ldo.c:532
9 0x0000000000422576 in luaV_execute (L=L@entry=0x7f189487af88, ci=<optimized out>) at lvm.c:1626
10 0x00000000004158e0 in ccall (L=L@entry=0x7f189487af88, func=<optimized out>, nResults=nResults@entry=-1, inc=inc@entry=1) at ldo.c:577
11 0x00000000004159ca in luaD_call (L=L@entry=0x7f189487af88, func=<optimized out>, nResults=nResults@entry=-1) at ldo.c:587
12 0x0000000000412a40 in lua_pcallk (L=L@entry=0x7f189487af88, nargs=nargs@entry=1, nresults=nresults@entry=-1, errfunc=errfunc@entry=2, ctx=ctx@entry=2, k=k@entry=0x42afd0 <finishpcall>) at lapi.c:1071
13 0x000000000042b07f in luaB_xpcall (L=0x7f189487af88) at lbaselib.c:473
14 0x0000000000415763 in luaD_precall (L=L@entry=0x7f189487af88, func=func@entry=0x7f17a55981b0, nresults=3) at ldo.c:532
15 0x0000000000422576 in luaV_execute (L=L@entry=0x7f189487af88, ci=<optimized out>, ci@entry=0x7f17dc692500) at lvm.c:1626
16 0x0000000000415423 in unroll (L=0x7f189487af88, ud=<optimized out>) at ldo.c:685
17 0x0000000000414b9a in luaD_rawrunprotected (L=L@entry=0x7f189487af88, f=f@entry=0x415910 <resume>, ud=ud@entry=0x7f19cf5f4f7c) at ldo.c:144
18 0x0000000000415a54 in lua_resume (L=L@entry=0x7f189487af88, from=from@entry=0x7f19a53d4a08, nargs=<optimized out>, nargs@entry=5, nresults=nresults@entry=0x7f19cf5f4fbc) at ldo.c:788
19 0x00007f19d8ffd05d in lua_resumeX (nresults=0x7f19cf5f4fbc, nargs=5, from=0x7f19a53d4a08, L=0x7f189487af88) at service-src/service_snlua.c:90
20 auxresume (narg=5, co=0x7f189487af88, L=0x7f19a53d4a08) at service-src/service_snlua.c:146
21 timing_resume (L=L@entry=0x7f19a53d4a08, co_index=co_index@entry=1, n=5) at service-src/service_snlua.c:198
22 0x00007f19d8ffd530 in luaB_coresume (L=0x7f19a53d4a08) at service-src/service_snlua.c:217
23 0x0000000000415763 in luaD_precall (L=L@entry=0x7f19a53d4a08, func=func@entry=0x7f17d8bf8410, nresults=nresults@entry=-1) at ldo.c:532
24 0x000000000042227b in luaV_execute (L=L@entry=0x7f19a53d4a08, ci=<optimized out>) at lvm.c:1656
25 0x00000000004158e0 in ccall (L=0x7f19a53d4a08, func=<optimized out>, nResults=<optimized out>, inc=65537) at ldo.c:577
26 0x0000000000414b9a in luaD_rawrunprotected (L=L@entry=0x7f19a53d4a08, f=f@entry=0x411370 <f_call>, ud=ud@entry=0x7f19cf5f52c0) at ldo.c:144
27 0x0000000000415c5e in luaD_pcall (L=L@entry=0x7f19a53d4a08, func=func@entry=0x411370 <f_call>, u=u@entry=0x7f19cf5f52c0, old_top=192, ef=<optimized out>) at ldo.c:892
28 0x00000000004129c7 in lua_pcallk (L=L@entry=0x7f19a53d4a08, nargs=<optimized out>, nresults=nresults@entry=-1, errfunc=errfunc@entry=0, ctx=ctx@entry=0, k=k@entry=0x42afd0 <finishpcall>) at lapi.c:1059
29 0x000000000042b0f0 in luaB_pcall (L=0x7f19a53d4a08) at lbaselib.c:456
30 0x0000000000415763 in luaD_precall (L=L@entry=0x7f19a53d4a08, func=func@entry=0x7f17d8bf82a0, nresults=2) at ldo.c:532
31 0x0000000000422576 in luaV_execute (L=L@entry=0x7f19a53d4a08, ci=<optimized out>) at lvm.c:1626
32 0x00000000004158e0 in ccall (L=0x7f19a53d4a08, func=<optimized out>, nResults=<optimized out>, inc=65537) at ldo.c:577
33 0x0000000000414b9a in luaD_rawrunprotected (L=L@entry=0x7f19a53d4a08, f=f@entry=0x411370 <f_call>, ud=ud@entry=0x7f19cf5f5590) at ldo.c:144
34 0x0000000000415c5e in luaD_pcall (L=L@entry=0x7f19a53d4a08, func=func@entry=0x411370 <f_call>, u=u@entry=0x7f19cf5f5590, old_top=48, ef=<optimized out>) at ldo.c:892
35 0x00000000004129c7 in lua_pcallk (L=L@entry=0x7f19a53d4a08, nargs=nargs@entry=5, nresults=nresults@entry=0, errfunc=errfunc@entry=1, ctx=ctx@entry=0, k=k@entry=0x0) at lapi.c:1059
36 0x00007f19ce6115b8 in _cb (context=0x7f19b6d38580, ud=0x7f19a53d4a08, type=20, session=2481, source=136, msg=0x7f19b8021900, sz=241) at lualib-src/lua-skynet.c:75
37 0x00000000004095d6 in dispatch_message (ctx=ctx@entry=0x7f19b6d38580, msg=msg@entry=0x7f19cf5f5650) at skynet-src/skynet_server.c:276
38 0x000000000040a1ac in skynet_context_message_dispatch (sm=sm@entry=0x7f19da80b300, q=0x7f19290a86c0, weight=weight@entry=1) at skynet-src/skynet_server.c:336
39 0x000000000040a95e in thread_worker (p=<optimized out>) at skynet-src/skynet_start.c:163
40 0x00007f19db955e25 in start_thread () from /lib64/libpthread.so.0
41 0x00007f19dad5bbad in clone () from /lib64/libc.so.6
看线程堆栈,像是线程调用luaL_loadfilex_ 造成的死锁,比较难复现。 skynet 版本1.5,lua版本是5.4.3
我不认为这是 skynet 的问题。
死锁发生在 freopen64 里,你可以 google 到一些关于 freopen 和 fclose 发生 deadlock 的问题(例如:https://www.cygwin.com/bugzilla/show_bug.cgi?id=24963 )。你可以尝试升级 crt ,看是否有 bug 需要修复。同时也检查进程打开文件数目有没有超过上限。
另外,freopen 只发生在 binary 文件中。我认为可以避免 binary 源码的使用。或者修改代码,直接用二进制方式打开源文件,不要走 freopen 。
我不认为这是 skynet 的问题。
死锁发生在 freopen64 里,你可以 google 到一些关于 freopen 和 fclose 发生 deadlock 的问题(例如:https://www.cygwin.com/bugzilla/show_bug.cgi?id=24963 )。你可以尝试升级 crt ,看是否有 bug 需要修复。同时也检查进程打开文件数目有没有超过上限。
另外,freopen 只发生在 binary 文件中。我认为可以避免 binary 源码的使用。或者修改代码,直接用二进制方式打开源文件,不要走 freopen 。
非常感谢!! 应该不是文件数量的问题
如果是agent模式热更确实会瞬间打开大量文件 我的方案是热更的文件内容先保存到加锁的hashmap,然后通知agent热更从map里面取
如果是agent模式热更确实会瞬间打开大量文件 我的方案是热更的文件内容先保存到加锁的hashmap,然后通知agent热更从map里面取
我们这边不是agent模式的,服务不是很多,同步热更的,所以同时打开的文件数不会太多。