dynomite icon indicating copy to clipboard operation
dynomite copied to clipboard

Dynomite server crashes in Memcached mode

Open ADenisUA opened this issue 7 years ago • 18 comments

When I start dynomite (I'm using code from v0.6 branch) it often fails (50% of the time) with:

[2017-11-20 12:57:52.619] conf_push_scalar:541 push 'memcached-4:8101:memcached-rack2:dc2:4294967294' [2017-11-20 12:57:52.620] conf_handler:1291 conf handler on dyn_seeds: memcached-4:8101:memcached-rack2:dc2:4294967294 [2017-11-20 12:57:52.622] _dn_alloc:259 malloc(24) at 0x1c53e30 @ ../dyn_array.c:64 [2017-11-20 12:57:52.623] conf_server_init:134 init conf server 0x1c538c0 [2017-11-20 12:57:52.660] dn_resolve_inet:520 address resolution of node 'memcached-4' service '8101' failed: Name or service not known [2017-11-20 12:57:52.660] _dn_free:305 free(0x1c53ed0) @ dyn_string.c:59 [2017-11-20 12:57:52.660] conf_handler:1302 conf: directive "dyn_seeds" has an invalid value [2017-11-20 12:57:52.660] conf_pop_scalar:564 pop 'memcached-4:8101:memcached-rack2:dc2:4294967294' [2017-11-20 12:57:52.660] _dn_free:305 free(0x1c53e50) @ dyn_string.c:59 dynomite: configuration file '/etc/dynomite/conf/memcached-3.yml' syntax is invalid [2017-11-20 12:57:52.660] conf_pop_scalar:564 pop 'dyn_seeds' [2017-11-20 12:57:52.660] _dn_free:305 free(0x1c53810) @ dyn_string.c:59 [2017-11-20 12:57:52.660] conf_pop_scalar:564 pop 'dyn_o_mite' [2017-11-20 12:57:52.660] _dn_free:305 free(0x1c53790) @ dyn_string.c:59 [2017-11-20 12:57:52.660] _dn_free:305 free(0x1c41b70) @ ../dyn_array.c:82 [2017-11-20 12:57:52.660] _dn_free:305 free(0x1c53710) @ dyn_string.c:59 [2017-11-20 12:57:52.660] dn_stacktrace:326 [0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390) [0x7fa2eaa2c390] ??:0 [2017-11-20 12:57:52.664] dn_stacktrace:326 [1] dynomite(string_deinit+0x3) [0x429953] /dynomite/src/dyn_string.c:58 [2017-11-20 12:57:52.669] dn_stacktrace:326 [2] dynomite(conf_destroy+0x55) [0x410385] /dynomite/src/dyn_conf.c:146 [2017-11-20 12:57:52.674] dn_stacktrace:326 [3] dynomite(conf_create+0x170) [0x410680] /dynomite/src/dyn_conf.c:2233 [2017-11-20 12:57:52.681] dn_stacktrace:326 [4] dynomite(core_start+0x80) [0x415350] /dynomite/src/dyn_core.c:281 [2017-11-20 12:57:52.685] dn_stacktrace:326 [5] dynomite(main+0x64f) [0x40cdff] /dynomite/src/dynomite.c:574 [2017-11-20 12:57:52.690] dn_stacktrace:326 [6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fa2ea368830] ??:0 [2017-11-20 12:57:52.693] dn_stacktrace:326 [7] dynomite(_start+0x29) [0x40d2a9] ??:? [2017-11-20 12:57:52.698] signal_handler:132 signal 11 (SIGSEGV) received, core dumping

meaning it can't discover another dynomite node. Also, if one node goes down, and other node tries to replicate - it also goes down. I'm running several dynomite nodes in docker, so there is definitely some race condition between nodes, but guess dynomite should some how gracefully handle such cases (try to reconnect several times after some timeout?). What would you suggest to do to avoid this problem?

Here is my config (for reference):

`dyn_o_mite: datacenter: dc2 dyn_listen: 0.0.0.0:8101 dyn_port: 8101 dyn_seed_provider: simple_provider dyn_seeds:

memcached-4:8101:memcached-rack2:dc2:4294967294 listen: 0.0.0.0:8102 rack: memcached-rack2 servers: 127.0.0.1:11211:1 timeout: 150000 tokens: 2147483647 data_store: 1

dyn_o_mite: datacenter: dc2 dyn_listen: 0.0.0.0:8101 dyn_port: 8101 dyn_seed_provider: simple_provider dyn_seeds:

memcached-3:8101:memcached-rack2:dc2:2147483647 listen: 0.0.0.0:8102 rack: memcached-rack2 servers: 127.0.0.1:11211:1 timeout: 150000 tokens: 4294967294 data_store: 1`

2)When I'm trying to use Spy Memcached client and save data - it works fine. But when I'm trying to read I receive:

[2017-11-20 12:44:49.561] server_rsp_forward:834 <CONN_CLIENT 0x528b660 11 from '172.18.0.1:54290'> <REQ 0x556fc00 4124:0::1 REQ_MC_GET, len:14> RECEIVED <RSP 0x5b1ce00 4126:0 RSP_MC_END len:5> [2017-11-20 12:44:49.561] stats_server_to_metric:1800 metric 'read_responses' for server [2017-11-20 12:44:49.561] _stats_server_incr:1816 incr field 'read_responses' to 1 [2017-11-20 12:44:49.561] stats_server_to_metric:1800 metric 'read_response_bytes' for server [2017-11-20 12:44:49.561] _stats_server_incr_by:1846 incr by field 'read_response_bytes' to 5 [2017-11-20 12:44:49.561] msg_local_one_rsp_handler:1005 <REQ 0x556fc00 4124:0::1 REQ_MC_GET, len:14> SELECTED <RSP 0x5b1ce00 4126:0 RSP_MC_END len:5> [2017-11-20 12:44:49.564] dn_stacktrace:326 [0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390) [0x7f07351b4390] ??:0 [2017-11-20 12:44:49.578] dn_stacktrace:326 [1] dynomite(memcache_post_coalesce+0x114) [0x4331a4] /dynomite/src/proto/dyn_memcache.c:1556 [2017-11-20 12:44:49.596] dn_stacktrace:326 [2] dynomite(req_done+0x174) [0x420994] /dynomite/src/dyn_request.c:148 [2017-11-20 12:44:49.612] dn_stacktrace:326 [3] dynomite() [0x40db52] /dynomite/src/dyn_client.c:282 [2017-11-20 12:44:49.622] dn_stacktrace:326 [4] dynomite(msg_recv+0x575) [0x420385] /dynomite/src/dyn_message.c:890 [2017-11-20 12:44:49.632] dn_stacktrace:326 [5] dynomite(core_core+0xff) [0x414aff] /dynomite/src/dyn_core.c:386 [2017-11-20 12:44:49.641] dn_stacktrace:326 [6] dynomite(event_wait+0xeb) [0x43a90b] /dynomite/src/event/dyn_epoll.c:259 (discriminator 2) [2017-11-20 12:44:49.713] dn_stacktrace:326 [7] dynomite(core_loop+0x166) [0x415996] /dynomite/src/dyn_core.c:650 (discriminator 4) [2017-11-20 12:44:49.732] dn_stacktrace:326 [8] dynomite(main+0x671) [0x40ce21] /dynomite/src/dynomite.c:585 [2017-11-20 12:44:49.743] dn_stacktrace:326 [9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f0734af0830] ??:0 [2017-11-20 12:44:49.755] dn_stacktrace:326 [10] dynomite(_start+0x29) [0x40d2a9] ??:? [2017-11-20 12:44:49.774] signal_handler:132 signal 11 (SIGSEGV) received, core dumping /dynomite/startup.sh: line 6: 9 Segmentation fault (core dumped) dynomite --conf-file=/etc/dynomite/conf/${DYN_CONFIG_FILE} --verbosity=11

is this some known problem?

ADenisUA avatar Nov 21 '17 11:11 ADenisUA

Try to create different issues for different crashes. For the first one, it complains that the syntax for the file is invalid, can you post your conf file for that node? For the second case, this could be a bug since that is part of the code that I changed recently and we do not use memcache actively internally so may be few things slipped through the crack. Can you please try if v0.5.9 works for you?

shailesh33 avatar Nov 25 '17 18:11 shailesh33

For the first issue - I've mentioned my config above:

dyn_o_mite: datacenter: dc2 dyn_listen: 0.0.0.0:8101 dyn_port: 8101 dyn_seed_provider: simple_provider dyn_seeds:

memcached-4:8101:memcached-rack2:dc2:4294967294 listen: 0.0.0.0:8102 rack: memcached-rack2 servers: 127.0.0.1:11211:1 timeout: 150000 tokens: 2147483647 data_store: 1

dyn_o_mite: datacenter: dc2 dyn_listen: 0.0.0.0:8101 dyn_port: 8101 dyn_seed_provider: simple_provider dyn_seeds:

memcached-3:8101:memcached-rack2:dc2:2147483647 listen: 0.0.0.0:8102 rack: memcached-rack2 servers: 127.0.0.1:11211:1 timeout: 150000 tokens: 4294967294 data_store: 1

let me try 0.5.9

thanks for feedback

ADenisUA avatar Nov 26 '17 08:11 ADenisUA

let me know once you tried 0.5.9

shailesh33 avatar Dec 12 '17 03:12 shailesh33

Same problem with 0.5.9

ADenisUA avatar Dec 12 '17 09:12 ADenisUA

I can replicate the above crash WITHOUT replication on Dynomite 0.6.4.

0.5.9-5 does not crash without replication, however results are returned sporadically from memcache (set calls are sent to memcache, but subsequent gets fail). I can also reproduce the crash in 0.5.9-5 with replication/seeding enabled.

The same configs with redis work fine on all versions.

Let me know if I can provide any information to help debug this. Difficulty - I'm not a developer but I did stay at a Holiday Inn last night.

Thanks!

whoiswes avatar Jan 16 '18 18:01 whoiswes

@whoiswes Dynomite at netflix has been prominently used with redis as data store so the memcache support might have been broken recently. We use EVCache for Memcache at Netflix. Have a look https://github.com/Netflix/EVCache. I will try and fix this. Any reason you prefer Memcache over Redis?

shailesh33 avatar Jan 16 '18 23:01 shailesh33

@shailesh33 - thanks for the note on EVCache. I don't think that fits our use case but will look more as time permits. We are currently using memcache with twemproxy/nutcracker as a frontend (not that I prefer it over redis, it's legacy code and politics 😉 ) - since dynomite supports (or used to, anyway) memcache, we were hoping it would be close to a drop-in replacement for twemproxy that would provide us cross-datacenter replication.

I do have a proposal in to add redis to our stack but again....politics.

If I can be of assistance in testing/debugging please let me know, I'd be happy to help.

whoiswes avatar Jan 17 '18 01:01 whoiswes

I believe there are a few changes in the nc_memcache.c that have not ported in dyn_memcache.c. I would start with that first if you interested in contributing. There have been similar issues reported #220. There was a branch that I started in the past but never finished it.

ipapapa avatar Jan 20 '18 01:01 ipapapa

@ipapapa - thanks for the pointers. I did a quick diff between nc_memcache.c and dyn_memcache.c and I could not find anything that appeared to be missing (to my very untrained eye). Not sure where to go from here as I have zero practical C experience, but I will certainly help in any way my experience level allows.

Thanks!

whoiswes avatar Jan 26 '18 18:01 whoiswes

@whoiswes lets start with simplifying this: Have a single Dynomite node with Memcache and see if your Memcache calls fail similarly. Are you getting failures or crashes? Try to use dev branch.

Once you have this tell me what commands you are running (preferably a basic script on a command line interfact for me to easily run it) and then we can take it from there.

shailesh33 avatar Jan 26 '18 20:01 shailesh33

@shailesh33

I had done that before but here's a more complete dump.

Oracle Linux 7.1 (3.10.0)

Dynomite minimal config:

dyn_o_mite:
  data_store: 1
  listen: 0.0.0.0:11211
  dyn_listen: 0.0.0.0:13211
  tokens: 12345678
  servers:
  - 127.0.0.1:12211:1

memcache 1.5.4, running interactively on a separate console

memcache -U 12211 -P 12211 -vv

I'm primary a python guy, so utilizing pymemcache, running a client connecting to the dynomite port (11211), doing a set works, but doing a get results in the dynomite process dumping. Memcache sees the get command and returns it.

>>> from pymemcache.client import Client
>>> dyn = Client(('10.58.141.41', 11211))
>>> dyn.set('foo','bar')
True
>>> dyn.get('foo')
[2018-01-26 21:16:34.684] dn_stacktrace:326 [0] /lib64/libpthread.so.0(+0xf5e0) [0x2b49652615e0]
??:0
[2018-01-26 21:16:34.687] dn_stacktrace:326 [1] src/dynomite(memcache_post_coalesce+0x112) [0x429492]
/data/build/dynomite/src/proto/dyn_memcache.c:1561
[2018-01-26 21:16:34.691] dn_stacktrace:326 [2] src/dynomite(req_done+0x192) [0x41b862]
/data/build/dynomite/src/dyn_request.c:151
[2018-01-26 21:16:34.694] dn_stacktrace:326 [3] src/dynomite() [0x40d6f8]
/data/build/dynomite/src/dyn_client.c:282
[2018-01-26 21:16:34.698] dn_stacktrace:326 [4] src/dynomite() [0x41ddd8]
/data/build/dynomite/src/dyn_server.c:221
[2018-01-26 21:16:34.701] dn_stacktrace:326 [5] src/dynomite() [0x41df14]
/data/build/dynomite/src/dyn_server.c:243 (discriminator 2)
[2018-01-26 21:16:34.705] dn_stacktrace:326 [6] src/dynomite(core_loop+0xdd) [0x41348d]
/data/build/dynomite/src/dyn_core.c:450
[2018-01-26 21:16:34.708] dn_stacktrace:326 [7] src/dynomite(main+0x6a0) [0x40c940]
/data/build/dynomite/src/dynomite.c:585
[2018-01-26 21:16:34.711] dn_stacktrace:326 [8] /lib64/libc.so.6(__libc_start_main+0xf5) [0x2b4965791c05]
??:0
[2018-01-26 21:16:34.713] dn_stacktrace:326 [9] src/dynomite() [0x40cd39]
??:?
[2018-01-26 21:16:34.716] signal_handler:132 signal 11 (SIGSEGV) received, core dumping
Segmentation fault (core dumped)

Note that I can run the same commands against memcache (or twemproxy/nutcracker, for that matter) without issue:

>>> from pymemcache.client import Client
>>> mc = Client(('10.58.141.41', 12211))
>>> mc.set('foo', 'bar')
True
>>> mc.get('foo')
b'bar'

Apologies if I'm missing anything obvious here. Please let me know if I can provide anything to further assist!

EDIT: realized that I was getting a different dump when the tokens parameter was not set. Corrected and post updated.

Thanks,

Wes

whoiswes avatar Jan 26 '18 20:01 whoiswes

I can reproduce this as well using dynomite inside Docker (built with the official Dockerfile). My config:

dynomite:
  dyn_listen: 0.0.0.0:8101
  listen: 0.0.0.0:8102
  stats_listen: 0.0.0.0:22222
  data_store: 1
  tokens: 4294967294
  servers:
    - memcached-1:11211:10

I saved that as dynomite.yml, started a memcached container, and ran the following to start the Dynomite container: docker run -v $PWD/dynomite.yml:/etc/dynomite.yml -p 8102:8102 --rm --entrypoint '/dynomite/src/dynomite' --net testbed_default dynomite -c /etc/dynomite.yml -v11

Then, I ran the following to trigger the crash: echo -n "get foo\r\n" | nc localhost 8102

The full log file from Dynomite is at output.txt

MasonM avatar Mar 15 '18 21:03 MasonM

I'm getting the same "invalid syntax" for dyn_seeds and I'm using Redis with a config I've pretty much copied from apparently-working configs found in this repository.

dyn_o_mite:
  datacenter: s-v
  rack: rack1
  dyn_listen: 0.0.0.0:8101
  data_store: 0
  listen: 0.0.0.0:8102
  tokens: '1383429731'
  dyn_seeds:
  - dynomite-2:8101:rack1:s-v:1383429731
  dyn_seed_provider: simple_provider
  servers:
  - 127.0.0.1:22122:1
  stats_listen: 0.0.0.0:22222
  pem_key_file: /dynomite/conf/dynomite-key.pem
  secure_server_option: datacenter
  timeout: 150000

dyn_o_mite:
  datacenter: s-v
  rack: rack1
  dyn_listen: 0.0.0.0:8101
  data_store: 0
  listen: 0.0.0.0:8102
  tokens: '1383429731'
  dyn_seeds:
  - dynomite-1:8101:rack1:s-v:1383429731
  dyn_seed_provider: simple_provider
  servers:
  - 127.0.0.1:22122:1
  stats_listen: 0.0.0.0:22222
  pem_key_file: /dynomite/conf/dynomite-key.pem
  secure_server_option: datacenter
  timeout: 150000

jazoom avatar Apr 13 '18 00:04 jazoom

To add to my previous comment: it looks like it was an issue of the services in Docker repeatedly failing because they couldn't find each other and eventually they happened to be up at the same time.

I don't think that should be reported as "invalid syntax" since the syntax is obviously correct.

It seems the dynomite instances still don't sync, but I guess that's another issue.

jazoom avatar Apr 13 '18 00:04 jazoom

Invalid syntax is reported for dynomite.yml, not in case of miscommunication. I am not following the issue you are facing and how it is related to Docker container communication.

ipapapa avatar Apr 13 '18 00:04 ipapapa

I'm saying the logs from Dynomite are telling me the syntax is invalid even though it's apparently not invalid. That is how I found this issue. I searched for the error I was seeing.

jazoom avatar Apr 13 '18 00:04 jazoom

Can you send us your Dynomite logs?

ipapapa avatar Apr 13 '18 00:04 ipapapa

Sigh. I just spun up the services again and am now not getting this error. I have not changed the config file at all, so the syntax is the same as before when it was throwing the invalid syntax error.

Still not getting the data synched between the services, but that's another issue and I think I'm done with trying to make Dynomite work.

jazoom avatar Apr 13 '18 00:04 jazoom