dynomite
dynomite copied to clipboard
Segfault when using florida_provider
Hi,
Dynomite seems to crash intermittently when I use florida_provider. I'm running version dynomite-v0.5.7-13_RectifyConsistency from master branch. The trace for the core dump is as follows:
(gdb) bt full
#0 _dn_strrchr (c=58 ':', start=0x7f24ec10f890 "",
p=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>) at dyn_string.h:119
No locals.
#1 parse_seeds (seeds=seeds@entry=0x7f24f47f6db0, dc_name=dc_name@entry=0x7f24f47f6d70,
rack_name=rack_name@entry=0x7f24f47f6d60, port_str=port_str@entry=0x7f24f47f6d80,
address=address@entry=0x7f24f47f6d90, name=name@entry=0x7f24f47f6da0, ptoken=ptoken@entry=0x7f24f47f6dc0)
at dyn_gossip.c:401
status = <optimized out>
p = 0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>
start = 0x7f24ec10f890 ""
pname = <optimized out>
port = 0x0
rack = <optimized out>
dc = <optimized out>
token = 0x1 <Address 0x1 out of bounds>
addr = <optimized out>
k = 1
pnamelen = <optimized out>
portlen = 0
racklen = 0
dclen = 0
tokenlen = 3960535183
addrlen = <optimized out>
delim = "::::"
t_end = <optimized out>
#2 0x000000000043249c in gossip_update_seeds (seeds=<optimized out>, seeds=<optimized out>, sp=0x1c370e0)
at dyn_gossip.c:731
p = 0x7f24ec0008bf ""
q = <optimized out>
dc_name = {len = 0, data = 0x0}
ip = {len = 0, data = 0x0}
token = {signum = 0, mag = 0x0, len = 0}
temp = {len = 0, data = 0x7f24ec10f890 ""}
rack_name = {len = 0, data = 0x0}
port_str = {len = 0, data = 0x0}
address = {len = 0, data = 0x0}
start = 0x7f24ec0008c0 "172.31.11.240:8101:racka:eu-west-1:0|172.31.24.52:8101:rackb:eu-west-1:0"
seed_node = 0x7f24ec0008c0 "172.31.11.240:8101:racka:eu-west-1:0|172.31.24.52:8101:rackb:eu-west-1:0"
seed_node_len = 0
#3 gossip_loop (arg=0x1c370e0) at dyn_gossip.c:796
sp = <optimized out>
gossip_interval = 30000000
__FUNCTION__ = "gossip_loop"
#4 0x00007f24f711cdc5 in start_thread (arg=0x7f24f47f7700) at pthread_create.c:308
__res = <optimized out>
pd = 0x7f24f47f7700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139796697544448, -6367975670704337661, 0,
139796697545152, 139796697544448, 29585936, 6478866919655378179, 6478861264210398467},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
#5 0x00007f24f6b4828d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.
@dzhou121 we use florida_provider as well. How often do you see this happening?
@ipapapa Sometimes it crashes only around 10 minutes after restarting; sometimes it took half an hour.
hello guys,
@dzhou121 did you sort this out somehow?
Does this work on branch v0.5.9, there were some fixes around that code.
@shailesh33 ,
did you guys apply this to 0.5.8?
The main issues with 0.5.9 for us was this https://github.com/Netflix/dynomite/issues/432 did you guys fixed?
@shailesh33 @ipapapa
Just adding more information.
Basically, something went wrong after gossip update. Right after Dynomite calling parse_seeds function on dyn_gossip(401) and _dn_strrchr function on dyn_string(119).
Complete GDB bt full
(gdb) bt full
#0 _dn_strrchr (c=58 ':', start=0x7f411010ebe0 "",
p=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>) at dyn_string.h:119
No locals.
#1 parse_seeds (seeds=seeds@entry=0x7f4115f45df0, dc_name=dc_name@entry=0x7f4115f45db0,
rack_name=rack_name@entry=0x7f4115f45da0, port_str=port_str@entry=0x7f4115f45dc0,
address=address@entry=0x7f4115f45dd0, name=name@entry=0x7f4115f45de0, ptoken=ptoken@entry=0x7f4115f45e00)
at dyn_gossip.c:401
status = <optimized out>
p = 0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>
start = 0x7f411010ebe0 ""
pname = <optimized out>
port = 0x0
rack = <optimized out>
dc = <optimized out>
token = 0x1 <Address 0x1 out of bounds>
addr = <optimized out>
k = 1
pnamelen = <optimized out>
portlen = 0
racklen = 0
dclen = 0
tokenlen = 269544415
addrlen = <optimized out>
delim = "::::"
t_end = <optimized out>
#2 0x000000000042d26c in gossip_update_seeds (seeds=<optimized out>, seeds=<optimized out>, sp=0x15600e0)
at dyn_gossip.c:731
p = 0x7f41100008bf ""
q = <optimized out>
---Type <return> to continue, or q <return> to quit---
dc_name = {len = 0, data = 0x0}
ip = {len = 0, data = 0x0}
token = {signum = 0, mag = 0x0, len = 0}
temp = {len = 0, data = 0x7f411010ebe0 ""}
rack_name = {len = 0, data = 0x0}
port_str = {len = 0, data = 0x0}
address = {len = 0, data = 0x0}
start = 0x7f41100008c0 "node1:8101:us-west-2a:us-west-2:100|node2:8101:us-west-2b:us-west-2:100|node3"...
seed_node = 0x7f41100008c0 "node1:8101:us-west-2a:us-west-2:100|node2:8101:us-west-2b:us-west-2:100|node3"...
seed_node_len = 0
#3 gossip_loop (arg=0x15600e0) at dyn_gossip.c:796
sp = <optimized out>
gossip_interval = 30000000
__FUNCTION__ = "gossip_loop"
#4 0x00007f41183e4dc5 in start_thread (arg=0x7f4115f46700) at pthread_create.c:308
__res = <optimized out>
pd = 0x7f4115f46700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139917517940480, -3413584272280359325, 0,
139917517941184, 139917517940480, 22413840, 3324405120805793379, 3324416614702804579},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
---Type <return> to continue, or q <return> to quit---
freesize = <optimized out>
#5 0x00007f4117e106ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.
looking into it
what is the output of your get_seeds from florida?
@shailesh33
start = 0x7f41100008c0 "node1:8101:us-west-2a:us-west-2:100|node2:8101:us-west-2b:us-west-2:100|node3"...
seed_node = 0x7f41100008c0 "node1:8101:us-west-2a:us-west-2:100|node2:8101:us-west-2b:us-west-2:100|node3"...
do you have handle to the seeds array that is at the top of the stack? do you have a core dump?
Also, I have learned that its not a bad idea to disable optimizations in production clusters just for these reasons so we have all the information required when we need it.
looks like its auto truncated. Can you get the full output from florida?
I do have a core dump. I was able to reproduce local 2x. This is very intermittent.