dynomite icon indicating copy to clipboard operation
dynomite copied to clipboard

Segfault when using florida_provider

Open dzhou121 opened this issue 9 years ago • 13 comments

Hi,

Dynomite seems to crash intermittently when I use florida_provider. I'm running version dynomite-v0.5.7-13_RectifyConsistency from master branch. The trace for the core dump is as follows:

(gdb) bt full
#0  _dn_strrchr (c=58 ':', start=0x7f24ec10f890 "",
    p=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>) at dyn_string.h:119
No locals.
#1  parse_seeds (seeds=seeds@entry=0x7f24f47f6db0, dc_name=dc_name@entry=0x7f24f47f6d70,
    rack_name=rack_name@entry=0x7f24f47f6d60, port_str=port_str@entry=0x7f24f47f6d80,
    address=address@entry=0x7f24f47f6d90, name=name@entry=0x7f24f47f6da0, ptoken=ptoken@entry=0x7f24f47f6dc0)
    at dyn_gossip.c:401
        status = <optimized out>
        p = 0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>
        start = 0x7f24ec10f890 ""
        pname = <optimized out>
        port = 0x0
        rack = <optimized out>
        dc = <optimized out>
        token = 0x1 <Address 0x1 out of bounds>
        addr = <optimized out>
        k = 1
        pnamelen = <optimized out>
        portlen = 0
        racklen = 0
        dclen = 0
        tokenlen = 3960535183
        addrlen = <optimized out>
        delim = "::::"
        t_end = <optimized out>
#2  0x000000000043249c in gossip_update_seeds (seeds=<optimized out>, seeds=<optimized out>, sp=0x1c370e0)
    at dyn_gossip.c:731
        p = 0x7f24ec0008bf ""
        q = <optimized out>
        dc_name = {len = 0, data = 0x0}
        ip = {len = 0, data = 0x0}
        token = {signum = 0, mag = 0x0, len = 0}
        temp = {len = 0, data = 0x7f24ec10f890 ""}
        rack_name = {len = 0, data = 0x0}
        port_str = {len = 0, data = 0x0}
        address = {len = 0, data = 0x0}
        start = 0x7f24ec0008c0 "172.31.11.240:8101:racka:eu-west-1:0|172.31.24.52:8101:rackb:eu-west-1:0"
        seed_node = 0x7f24ec0008c0 "172.31.11.240:8101:racka:eu-west-1:0|172.31.24.52:8101:rackb:eu-west-1:0"
        seed_node_len = 0
#3  gossip_loop (arg=0x1c370e0) at dyn_gossip.c:796
        sp = <optimized out>
        gossip_interval = 30000000
        __FUNCTION__ = "gossip_loop"
#4  0x00007f24f711cdc5 in start_thread (arg=0x7f24f47f7700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f24f47f7700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139796697544448, -6367975670704337661, 0,
                139796697545152, 139796697544448, 29585936, 6478866919655378179, 6478861264210398467},
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
              canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#5  0x00007f24f6b4828d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

dzhou121 avatar Sep 09 '16 01:09 dzhou121

@dzhou121 we use florida_provider as well. How often do you see this happening?

ipapapa avatar Sep 23 '16 17:09 ipapapa

@ipapapa Sometimes it crashes only around 10 minutes after restarting; sometimes it took half an hour.

dzhou121 avatar Sep 24 '16 02:09 dzhou121

hello guys,

@dzhou121 did you sort this out somehow?

cyberjso avatar Aug 10 '17 17:08 cyberjso

Does this work on branch v0.5.9, there were some fixes around that code.

shailesh33 avatar Aug 10 '17 18:08 shailesh33

@shailesh33 ,

did you guys apply this to 0.5.8?

The main issues with 0.5.9 for us was this https://github.com/Netflix/dynomite/issues/432 did you guys fixed?

diegopacheco avatar Aug 11 '17 03:08 diegopacheco

@shailesh33 @ipapapa

Just adding more information.

Basically, something went wrong after gossip update. Right after Dynomite calling parse_seeds function on dyn_gossip(401) and _dn_strrchr function on dyn_string(119).

Complete GDB bt full

(gdb) bt full
#0  _dn_strrchr (c=58 ':', start=0x7f411010ebe0 "",
    p=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>) at dyn_string.h:119
No locals.
#1  parse_seeds (seeds=seeds@entry=0x7f4115f45df0, dc_name=dc_name@entry=0x7f4115f45db0,
    rack_name=rack_name@entry=0x7f4115f45da0, port_str=port_str@entry=0x7f4115f45dc0,
    address=address@entry=0x7f4115f45dd0, name=name@entry=0x7f4115f45de0, ptoken=ptoken@entry=0x7f4115f45e00)
    at dyn_gossip.c:401
        status = <optimized out>
        p = 0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>
        start = 0x7f411010ebe0 ""
        pname = <optimized out>
        port = 0x0
        rack = <optimized out>
        dc = <optimized out>
        token = 0x1 <Address 0x1 out of bounds>
        addr = <optimized out>
        k = 1
        pnamelen = <optimized out>
        portlen = 0
        racklen = 0
        dclen = 0
        tokenlen = 269544415
        addrlen = <optimized out>
        delim = "::::"
        t_end = <optimized out>
#2  0x000000000042d26c in gossip_update_seeds (seeds=<optimized out>, seeds=<optimized out>, sp=0x15600e0)
    at dyn_gossip.c:731
        p = 0x7f41100008bf ""
        q = <optimized out>
---Type <return> to continue, or q <return> to quit---
        dc_name = {len = 0, data = 0x0}
        ip = {len = 0, data = 0x0}
        token = {signum = 0, mag = 0x0, len = 0}
        temp = {len = 0, data = 0x7f411010ebe0 ""}
        rack_name = {len = 0, data = 0x0}
        port_str = {len = 0, data = 0x0}
        address = {len = 0, data = 0x0}
        start = 0x7f41100008c0 "node1:8101:us-west-2a:us-west-2:100|node2:8101:us-west-2b:us-west-2:100|node3"...
        seed_node = 0x7f41100008c0 "node1:8101:us-west-2a:us-west-2:100|node2:8101:us-west-2b:us-west-2:100|node3"...
        seed_node_len = 0
#3  gossip_loop (arg=0x15600e0) at dyn_gossip.c:796
        sp = <optimized out>
        gossip_interval = 30000000
        __FUNCTION__ = "gossip_loop"
#4  0x00007f41183e4dc5 in start_thread (arg=0x7f4115f46700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f4115f46700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139917517940480, -3413584272280359325, 0,
                139917517941184, 139917517940480, 22413840, 3324405120805793379, 3324416614702804579},
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
              canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
---Type <return> to continue, or q <return> to quit---
        freesize = <optimized out>
#5  0x00007f4117e106ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

diegopacheco avatar Aug 11 '17 17:08 diegopacheco

looking into it

shailesh33 avatar Aug 11 '17 17:08 shailesh33

what is the output of your get_seeds from florida?

shailesh33 avatar Aug 11 '17 17:08 shailesh33

@shailesh33

    start = 0x7f41100008c0 "node1:8101:us-west-2a:us-west-2:100|node2:8101:us-west-2b:us-west-2:100|node3"...
    seed_node = 0x7f41100008c0 "node1:8101:us-west-2a:us-west-2:100|node2:8101:us-west-2b:us-west-2:100|node3"...

diegopacheco avatar Aug 11 '17 17:08 diegopacheco

do you have handle to the seeds array that is at the top of the stack? do you have a core dump?

shailesh33 avatar Aug 11 '17 17:08 shailesh33

Also, I have learned that its not a bad idea to disable optimizations in production clusters just for these reasons so we have all the information required when we need it.

shailesh33 avatar Aug 11 '17 17:08 shailesh33

looks like its auto truncated. Can you get the full output from florida?

shailesh33 avatar Aug 11 '17 17:08 shailesh33

I do have a core dump. I was able to reproduce local 2x. This is very intermittent.

diegopacheco avatar Aug 11 '17 18:08 diegopacheco