g2z icon indicating copy to clipboard operation
g2z copied to clipboard

net.Dial doesn't work in zabbix_agentd

Open aingvarf opened this issue 8 years ago • 17 comments

When run zabbix_agentd -t key all is working fine. When try zabbix_get with the same key the library hangs on net.Dial.

 g2z.LogInfof("Before dial")
 conn, err := net.DialTimeout("tcp", servAddr, time.Duration(2*time.Second) )
 g2z.LogInfof("After dial")
 if err != nil {
        g2z.LogInfof("Dial failed")
        return "", errors.New("Dial failed")
 }
 g2z.LogInfof("Dial ok")

Log file have "Before dial" and never "After dial" and "Dial ok". Timeout in Dial also doesn't work.

aingvarf avatar Oct 08 '15 10:10 aingvarf

What operating system and version of Go and Zabbix are you using?

cavaliercoder avatar Oct 10 '15 09:10 cavaliercoder

I can reproduce this issue on Go 1.5.1, Zabbix 2.2.10 on CentOS 7.1. I'll advise what I find. Might be an issue with go routines firing up in the net API.

cavaliercoder avatar Oct 11 '15 07:10 cavaliercoder

System call trace:

[pid 12184] select(6, [4 5], NULL, NULL, NULL) = 1 (in [4])
[pid 12184] accept(4, {sa_family=AF_INET, sin_port=htons(47746), sin_addr=inet_addr("127.0.0.1")}, [16]) = 6
[pid 12184] getpeername(6, {sa_family=AF_INET, sin_port=htons(47746), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
[pid 12184] alarm(3)                    = 0
[pid 12184] read(6, "ZBXD\1", 5)        = 5
[pid 12184] read(6, "\7\0\0\0\0\0\0\0", 8) = 8
[pid 12184] read(6, "broken\n", 2047)   = 7
[pid 12184] alarm(0)                    = 3
[pid 12184] rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
[pid 12184] sigaltstack({ss_sp=0xc820020000, ss_flags=0, ss_size=32672}, NULL) = 0
[pid 12184] gettid()                    = 12184
[pid 12184] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 12184] semop(655360, {{0, -1, SEM_UNDO}}, 1) = 0
[pid 12184] open("/var/log/zabbix/zabbix_agentd.log", O_RDWR|O_CREAT|O_APPEND, 0666) = 7
[pid 12184] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=453, ...}) = 0
[pid 12184] fstat(7, {st_mode=S_IFREG|0664, st_size=6699, ...}) = 0
[pid 12184] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f92d61c5000
[pid 12184] write(7, " 12184:20151011:161133.055 Befor"..., 39) = 39
[pid 12184] close(7)                    = 0
[pid 12184] munmap(0x7f92d61c5000, 4096) = 0
[pid 12184] semop(655360, {{0, 1, SEM_UNDO}}, 1) = 0
[pid 12184] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
[pid 12184] setsockopt(7, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
[pid 12184] connect(7, {sa_family=AF_INET, sin_port=htons(10050), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 12184] epoll_create1(EPOLL_CLOEXEC) = 8
[pid 12184] epoll_ctl(8, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3491763008, u64=140268533715776}}) = 0
[pid 12184] futex(0x7f92d1064390, FUTEX_WAKE, 1) = 0
[pid 12184] futex(0xc82001e110, FUTEX_WAIT, 0, NULL

cavaliercoder avatar Oct 11 '15 08:10 cavaliercoder

I have similar issue with

http.Post()

Standalone exe is OK, but shared library hangs on this function - probably related to net package. Environment: Ubuntu 14, Go 1.5.1, Zabbix 2.4.6.

Strace:

futex(0x7f618b2c1cd0, FUTEX_WAKE, 1)    = 0
futex(0xc820022110, FUTEX_WAIT, 0, NULL

Go stacktrace 1 from zabbix.log, when I send SIGQUIT signal:

 72783:20151020:192755.326 collector [idle 1 sec]
SIGQUIT: quit
PC=0x7fee9244505a m=1

goroutine 0 [idle]:
runtime.rtsigprocmask(0xc800000002, 0xc82002adc8, 0x0)
    /usr/local/go/src/runtime/sys_linux_amd64.s:206 +0x1a
runtime.updatesigmask(0x0, 0x100000000)
    /usr/local/go/src/runtime/os1_linux.go:343 +0xf3
runtime.crash()
    /usr/local/go/src/runtime/signal1_unix.go:193 +0x33
runtime.sighandler(0xc800000003, 0xc82002b080, 0xc82002af50, 0xc820000300)
    /usr/local/go/src/runtime/signal_amd64x.go:200 +0x6b2
runtime.sigtrampgo(0x3, 0xc82002b080, 0xc82002af50)
    /usr/local/go/src/runtime/signal_linux.go:94 +0x97
runtime.sigtramp(0x1, 0x0, 0xc820024000, 0x1, 0x7fa0, 0xc82002b240, 0xcccccccccccccccd, 0x8, 0x246, 0x7fee92789d60, ...)
    /usr/local/go/src/runtime/sys_linux_amd64.s:234 +0x1b
runtime.sigreturn(0x0, 0xc820024000, 0x1, 0x7fa0, 0xc82002b240, 0xcccccccccccccccd, 0x8, 0x246, 0x7fee92789d60, 0x7fee92816472, ...)
    /usr/local/go/src/runtime/sys_linux_amd64.s:238
rax    0x0
rbx    0xc82002adc8
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x2
rsi    0xc82002adc8
rbp    0x3
rsp    0xc82002ada0
r8     0xc82002ae50
r9     0xcccccccccccccccd
r10    0x8
r11    0x246
r12    0x7fee92789d60
r13    0x7fee92816472
r14    0x9
r15    0x8
rip    0x7fee9244505a
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

-----

 72785:20151020:192756.138 In send_buffer() host:'127.0.0.1' port:10051 values:0/100

Go stacktrace 2:

SIGQUIT: quit
PC=0x7f0cad5c9081 m=1

goroutine 0 [idle]:
runtime.futex(0xc820022110, 0x0, 0x0, 0x0, 0x7f0c00000000, 0x7f0cad577879, 0x0, 0x0, 0x7f0cad577ae8, 0xc820022110, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:288 +0x21
runtime.futexsleep(0xc820022110, 0xc800000000, 0xffffffffffffffff)
        /usr/local/go/src/runtime/os1_linux.go:39 +0x53
runtime.notesleep(0xc820022110)
        /usr/local/go/src/runtime/lock_futex.go:142 +0xa8
runtime.stoplockedm()
        /usr/local/go/src/runtime/proc1.go:1268 +0xb2
runtime.schedule()
        /usr/local/go/src/runtime/proc1.go:1590 +0x72
runtime.park_m(0xc820000600)
        /usr/local/go/src/runtime/proc1.go:1698 +0x191
runtime.mcall(0x7f0cad5c722a)
        /usr/local/go/src/runtime/asm_amd64.s:204 +0x53

goroutine 17 [IO wait, locked to thread]:
runtime.gopark(0x7f0cada4a878, 0x7f0ca594b198, 0x7f0cad993b20, 0x7, 0x1b, 0x5)
        /usr/local/go/src/runtime/proc.go:185 +0x169 fp=0xc820041868 sp=0xc820041840
runtime.netpollblock(0x7f0ca594b170, 0x72, 0x7f0cac21e050)
        /usr/local/go/src/runtime/netpoll.go:338 +0x160 fp=0xc8200418b0 sp=0xc820041868
net.runtime_pollWait(0x7f0ca594b170, 0x72, 0xc82000a1c0)
        /usr/local/go/src/runtime/netpoll.go:157 +0x62 fp=0xc8200418d0 sp=0xc8200418b0
net.(*pollDesc).Wait(0xc82011c140, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3c fp=0xc8200418f0 sp=0xc8200418d0
net.(*pollDesc).WaitRead(0xc82011c140, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x38 fp=0xc820041918 sp=0xc8200418f0
net.(*netFD).accept(0xc82011c0e0, 0x0, 0x7f0ca594b268, 0xc8200b4440)
        /usr/local/go/src/net/fd_unix.go:408 +0x27e fp=0xc820041a18 sp=0xc820041918
net.(*TCPListener).AcceptTCP(0xc8200b8050, 0xc820041ab0, 0x0, 0x0)
        /usr/local/go/src/net/tcpsock_posix.go:254 +0x4f fp=0xc820041a78 sp=0xc820041a18
net/http.tcpKeepAliveListener.Accept(0xc8200b8050, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2135 +0x43 fp=0xc820041ab0 sp=0xc820041a78
net/http.(*Server).Serve(0xc8200b20c0, 0x7f0ca594b230, 0xc8200b8050, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1887 +0xb5 fp=0xc820041b98 sp=0xc820041ab0
net/http.(*Server).ListenAndServe(0xc8200b20c0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1877 +0x138 fp=0xc820041c28 sp=0xc820041b98
net/http.ListenAndServe(0x7f0cad9a6ae0, 0xe, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1967 +0x91 fp=0xc820041c50 sp=0xc820041c28
main.Code(0xc8201122a0, 0x0, 0x0, 0x0, 0x0)
        /opt/gopath/src/github.com/jangaraj/zabbix-module/zabbix_module.go:152 +0x94 fp=0xc820041d58 sp=0xc820041c50
github.com/cavaliercoder/g2z.route_item(0x7ffe7bc803e0, 0x7ffe7bc80c90, 0x0)
        /opt/gopath/src/github.com/cavaliercoder/g2z/router.go:65 +0xc2a fp=0xc820041ed0 sp=0xc820041d58
runtime.call32(0x0, 0x7ffe7bc802d8, 0x7ffe7bc80360, 0x18)
        /usr/local/go/src/runtime/asm_amd64.s:437 +0x40 fp=0xc820041ef8 sp=0xc820041ed0
runtime.cgocallbackg1()
        /usr/local/go/src/runtime/cgocall.go:252 +0x110 fp=0xc820041f30 sp=0xc820041ef8
runtime.cgocallbackg()
        /usr/local/go/src/runtime/cgocall.go:177 +0xd9 fp=0xc820041f90 sp=0xc820041f30
runtime.cgocallback_gofunc(0x0, 0x0, 0x0)
        /usr/local/go/src/runtime/asm_amd64.s:801 +0x5d fp=0xc820041fa0 sp=0xc820041f90
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc820041fa8 sp=0xc820041fa0

goroutine 18 [syscall, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

goroutine 5 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0x7f0cadd99bc0)
        /opt/gopath/src/github.com/golang/glog/glog.go:882 +0x69
created by github.com/golang/glog.init.1
        /opt/gopath/src/github.com/golang/glog/glog.go:410 +0x299

rax    0xca
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0xc820022110
rsi    0x0
rbp    0x1
rsp    0x7ffe7bc801a8
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x4
r13    0x7f0cada47090
r14    0xd
r15    0x8
rip    0x7f0cad5c9081
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

Reference: http://stackoverflow.com/questions/33248427/how-to-debug-c-shared-library-written-in-go-it-hangs-on-net-call

jangaraj avatar Oct 19 '15 23:10 jangaraj

Thanks Jan, that's really helpful. I'll dig right into this one ASAP.

cavaliercoder avatar Oct 21 '15 13:10 cavaliercoder

Probably problem is cgo. Net package doesn't have full native go implementation. That's also reason why is .so dynamically linked:

    linux-vdso.so.1 =>  (0x00007fff7d9fd000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f30ab42b000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f30ab066000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f30ac1ad000)

Still no success.

jangaraj avatar Oct 25 '15 22:10 jangaraj

Same issue occurs in https://github.com/spektroskop/zabbix-socket/. I haven't had time to debug yet, but I'm leaning towards an implementation conflict between Zabbix Agent Listener threads and Go routines (used by the net.Dial function). Zabbix no longer use pthreads and it seems unlikely that a fork() could cause issues.... but net.Dial() seems to work fine in a single thread context (i.e. zabbix_agentd -t ...)

cavaliercoder avatar Oct 29 '15 14:10 cavaliercoder

I think I have the same issue with my Zabbix module for MongoDB based on g2z (https://github.com/m-barthelemy/zabbix-mongo). It hangs when connecting to Mongo (mgo.Dial()) I confirm that zabbix_agentd -t always work.

On my local dev VM and on a small cloud instance, the module seems to work perfectly... 9 out of 10 times. Occasionally the first connection attempt hangs, but if I restart the Zabbix agent it will work again. Once the first connection works, it looks like all the subsequent ones also do. These 2 VMs only have 1 CPU/1 core. On another instance with 2 cores it always hangs. Env : Ubuntu 14.04, Zabbix Agent 3.0.2, Go 1.6, g2z/v3.

Is there any dirty or temporary workaround?

m-barthelemy avatar May 04 '16 14:05 m-barthelemy

Thanks for posting those details. Unfortunately I haven't been able to find a solution or even a workaround.

The best I can offer is that it looks like a dead locking issue somewhere in the Go runtime's goroutines when it is loaded as a shared library. Behind the scenes, all of the Go stdlib's socket based APIs (such as net.Dial use goroutines.

Someone recently validated this on @jangaraj's stack overflow post: http://stackoverflow.com/a/36898305/5809680.

This will likely need to be fleshed out and escalated to the Go dev team for resolution, or Zabbix to change the way modules are loaded to be post-fork.

cavaliercoder avatar May 04 '16 14:05 cavaliercoder

I've reproduced this issue outside of Zabbix and raised issue #15538 with the Go team. This might be a more general C/Linux issue however, due to the way memory is mapped when a process is forked.

I may need to suggest to the Zabbix team to load agent modules after the worker processes are forked to fix this issue.

cavaliercoder avatar May 04 '16 16:05 cavaliercoder

This code reproduces the issue: https://gist.github.com/cavaliercoder/688a3cd7dac20c8edb0c0f6f2851b54d

cavaliercoder avatar May 04 '16 16:05 cavaliercoder

Thanks a lot for your investigations. Looking at the answer from the Go team, it looks like the only option would be to convince the Zabbix team to change the way they load external modules.

From what I understand, any Zabbix plugin using threads would have the same issue right? If so, that is be very restrictive.

m-barthelemy avatar May 04 '16 18:05 m-barthelemy

Raised with Zabbix: https://support.zabbix.com/browse/ZBX-10751

cavaliercoder avatar May 05 '16 09:05 cavaliercoder

I'm struggling to find a way forward with this issue. I think IPC will be the only way, in which case a separate project may be in order to start, watch and communicate with processes written in any language.

Here's what I've discovered so far:

  • The Zabbix agent calls load_modules which dlopen()'s a g2z lib
  • dlopen calls any constructors in the shared library, which in the case of a Go library, calls the _init symbol which bootstraps the Go runtime, spawns one thread per CPU thread and calls the init Go function in g2z to register items
  • load_modules calls zbx_module_item_list and maps the item keys to the address of the handler funtions in the main thread
  • The agent then forks the collector and active check threads. The OS copies most memory mappings with Copy-On-Write but does not copy the Go runtime threads (as it should not)
  • When the item check functions written in g2z are called, they try to make use of threads which do not exist in the child process

Problems:

  • this is an edge case, not likely to be addressed by the Go team
  • this is an edge case, not likely to be addressed by the Zabbix team
  • any code in g2z to reload and remap the lib after fork would be too hacky and unsupportable

cavaliercoder avatar May 10 '16 14:05 cavaliercoder

This is a note for me to experiment with GOMAXPROCS = 1. Theory being:

  • each forked Go runtime is constrained to a single thread
  • we still benefit from multiple threads after the agent forks
  • we lose optimisations such as Goroutine work-stealing
  • we still benefit from concurrency in each discrete child PID via goroutines
  • behavior is still undefined and subject to future breakage but might work reliably for now

@m-barthelemy notes above that g2z behaves okay on single core machines (where Go will run on a single OS thread).

cavaliercoder avatar Aug 22 '17 15:08 cavaliercoder

I tested this with GOMAXPROCS=1... still hangs on Net.Dial at: futex(0xc420024110, FUTEX_WAIT, 0, NULL

cavaliercoder avatar Dec 17 '17 04:12 cavaliercoder

Yes, it has always been working fine on single core. And I confirm that setting GOMAXPROCS doesn't help at all.

It's been a while, but IIRC the way the Zabbix agent does its forking prevents users from making plugins in any language using threads by default (or using any library that does so), right? I found their refusal to change things disappointing and I've completely stopped working on that.

m-barthelemy avatar Dec 22 '17 01:12 m-barthelemy