nchan icon indicating copy to clipboard operation
nchan copied to clipboard

Assertion 'id->tagactive != i' failed

Open ivanovv opened this issue 7 years ago • 13 comments

Got an assert with nchan version: 1.2.1 Assertion line:

ker process: /build/nginx/debian/modules/nchan/src/util/nchan_output.c:420: msgtag_to_strptr: Assertion `id->tagactive != i' failed.

Lots of

2018/09/12 18:41:07 [error] 30072#30072: MEMSTORE:03: multimsg 0000000002F76660 timeout!!
2018/09/12 18:41:07 [error] 30072#30072: MEMSTORE:03: multimsg 00000000036E72A0 timeout!!
2018/09/12 18:41:07 [error] 30072#30072: MEMSTORE:03: multimsg 00000000036E73D0 timeout!!

lines after it.

Will it be possible to look into this? I have a crash dump (but nginx executable has no debug symbols), debug log, error log

gdb bt:

Core was generated by `nginx: worker process is shutting down          '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000513920 in nchan_memstore_find_chanhead ()
(gdb) bt
#0  0x0000000000513920 in nchan_memstore_find_chanhead ()
#1  0x000000000050bb55 in ?? ()
#2  0x000000000050abf9 in ?? ()
#3  0x000000000044f229 in ?? ()
#4  0x00000000004452b7 in ngx_process_events_and_timers ()
#5  0x000000000044cbf5 in ?? ()
#6  0x000000000044b6e0 in ngx_spawn_process ()
#7  0x000000000044ce64 in ?? ()
#8  0x000000000044db9f in ngx_master_process_cycle ()
#9  0x0000000000426a2f in main ()
total published messages: 6855521
stored messages: 668
shared memory used: 4964K
shared memory limit: 131072K
channels: 2348
subscribers: 5446
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 372399120
interprocess alerts in transit: 21784
interprocess queued alerts: 21009
total interprocess send delay: 0
total interprocess receive delay: 40
nchan version: 1.2.1

ivanovv avatar Sep 13 '18 17:09 ivanovv

Do you have the Nginx error log from when this happened? I'd like to see it from the time Nginx was started until the crash, if possible.

Here you go: https://www.dropbox.com/s/3er21w0up9thda0/error.log.2.gz?dl=0

BTW, this is a production system, so last nginx restart was at 2018/09/06 22:21:58

ivanovv avatar Sep 13 '18 17:09 ivanovv

I can also get you a debug log, though it will be a huge file

ivanovv avatar Sep 13 '18 17:09 ivanovv

It happened again:

2018/09/17 10:02:27 [error] 19964#19964: *1860873 open() "/u/apps/site_production/current/public/uploads/image/1012019/TOLLE_event_12_september_20_uur.png" failed (2: No such file or directory), client: 173.252.95.10, server: *.site.com, request: "GET /uploads/image/1012019/TOLLE_event_12_september_20_uur.png HTTP/1.1", host: "app.site.com"
ker process: /build/nginx/debian/modules/nchan/src/util/nchan_output.c:420: msgtag_to_strptr: Assertion `id->tagactive != i' failed.
2018/09/17 10:10:43 [alert] 19962#19962: worker process 19963 exited on signal 6 (core dumped)
2018/09/17 10:10:43 [alert] 19962#19962: worker process 30509 exited on signal 11 (core dumped)
2018/09/17 10:10:43 [alert] 19962#19962: worker process 30512 exited on signal 11 (core dumped)
2018/09/17 10:10:43 [alert] 19962#19962: worker process 30516 exited on signal 11 (core dumped)
2018/09/17 10:10:43 [alert] 19962#19962: worker process 30519 exited on signal 11 (core dumped)
2018/09/17 10:10:43 [alert] 19962#19962: worker process 30522 exited on signal 11 (core dumped)
2018/09/17 10:10:44 [alert] 19962#19962: worker process 30525 exited on signal 11 (core dumped)
2018/09/17 10:10:44 [alert] 19962#19962: worker process 30528 exited on signal 11 (core dumped)
2018/09/17 10:10:44 [alert] 19962#19962: worker process 30530 exited on signal 11 (core dumped)
2018/09/17 10:10:44 [alert] 19962#19962: worker process 30532 exited on signal 11 (core dumped)
2018/09/17 10:10:44 [alert] 19962#19962: worker process 30534 exited on signal 11 (core dumped)
2018/09/17 10:10:44 [alert] 19962#19962: worker process 30536 exited on signal 11 (core dumped)
2018/09/17 10:10:44 [alert] 19962#19962: worker process 30538 exited on signal 11 (core dumped)
2018/09/17 10:10:45 [alert] 19962#19962: worker process 30540 exited on signal 11 (core dumped)
2018/09/17 10:10:45 [error] 30542#30542: IPC-HANDLERS(0):Got keepalive for expired channel /event~WmfqPktv2tvy4psVRblNolLLKMKMEj6npQ
2018/09/17 10:10:50 [error] 19964#19964: MEMSTORE:01: multimsg 000000000358AA90 timeout!!
2018/09/17 10:10:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002B80880 timeout!!
2018/09/17 10:10:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003BA4F90 timeout!!
2018/09/17 10:10:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000029051B0 timeout!!
2018/09/17 10:10:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002AAB960 timeout!!
2018/09/17 10:10:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000029406E0 timeout!!
2018/09/17 10:10:51 [error] 19964#19964: MEMSTORE:01: multimsg 00000000039539F0 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002C2F1F0 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 00000000039F67F0 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003F0B050 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003F0B1E0 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003F0B370 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003F0B500 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 00000000043DB1F0 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 00000000025D0570 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003792830 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002EB8600 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002EB8790 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002EB8920 timeout!!
2018/09/17 10:11:03 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003894080 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 00000000025D4D30 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 00000000045359C0 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002434110 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 000000000394F420 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 00000000027EA2A0 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002A20E60 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003D7FD30 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 00000000037D5CF0 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 00000000041C3570 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002C367B0 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002956520 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 000000000326E710 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 000000000393E8C0 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 00000000044F0A50 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002957C30 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003F4ED50 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003AC9640 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 00000000040D75A0 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003F4E000 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003AC9D50 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002D4D580 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 000000000294E210 timeout!!
2018/09/17 10:11:04 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002604E70 timeout!!
2018/09/17 10:11:12 [error] 30542#30542: MEMSTORE:00: multimsg 00000000022B3AC0 timeout!!
2018/09/17 10:11:18 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003F25B00 timeout!!
2018/09/17 10:11:20 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002AD7D10 timeout!!
2018/09/17 10:11:46 [error] 19964#19964: MEMSTORE:01: tried adding WAITING chanhead 00000000040DD5F0 /event~PNGaCecNjqdAT9Yb4n6TCgMXdEhOTFeW9A~private~KLx2PN61EDlAy8tc to chanhead_gc. why?
2018/09/17 10:11:46 [error] 19964#19964: MEMSTORE:01: tried adding WAITING chanhead 00000000040DDA00 /event~PNGaCecNjqdAT9Yb4n6TCgMXdEhOTFeW9A to chanhead_gc. why?
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000029A9AD0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003258370 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 000000000334F9A0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000023DF1A0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000024F2EF0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003559B70 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000034C88F0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000032D2D60 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 000000000282F7B0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002936E70 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003BAA940 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 000000000362D580 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002E54650 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002B80D80 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003F259D0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000027EB0C0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000036D64F0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000004615DC0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002410170 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000045EF9D0 timeout!!
2018/09/17 10:11:50 [error] 19964#19964: MEMSTORE:01: multimsg 00000000039EC440 timeout!!
2018/09/17 10:11:51 [error] 19964#19964: MEMSTORE:01: multimsg 00000000037D3F00 timeout!!
2018/09/17 10:12:32 [error] 19964#19964: MEMSTORE:01: multimsg 000000000365EFF0 timeout!!
2018/09/17 10:12:32 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002BF32B0 timeout!!
2018/09/17 10:12:32 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002CF87E0 timeout!!
2018/09/17 10:12:32 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003B67820 timeout!!
2018/09/17 10:12:32 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003D38F10 timeout!!
2018/09/17 10:12:32 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003B7D0C0 timeout!!
2018/09/17 10:12:44 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002D3F5B0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 000000000244CCD0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000030D4080 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003BAD960 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000031719D0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000031FC390 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000043E0FB0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000043F3D00 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000034A4680 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003326470 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003FA8090 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 000000000294DCA0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000045EFCB0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000025EBA20 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000027E39C0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 000000000314CF10 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002929A80 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000036FB9E0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000034411D0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000032EFC90 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000036C6610 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003AC3030 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003460DD0 timeout!!
2018/09/17 10:12:52 [error] 19964#19964: MEMSTORE:01: multimsg 00000000040E4970 timeout!!
2018/09/17 10:12:53 [error] 19964#19964: MEMSTORE:01: multimsg 00000000038149B0 timeout!!
2018/09/17 10:13:02 [error] 30542#30542: MEMSTORE:00: tried adding WAITING chanhead 0000000002250BA0 /event~WmfqPktv2tvy4psVRblNolLLKMKMEj6npQ to chanhead_gc. why?
2018/09/17 10:13:14 [error] 19964#19964: MEMSTORE:01: multimsg 0000000004617590 timeout!!
2018/09/17 10:13:14 [error] 19964#19964: MEMSTORE:01: multimsg 0000000002CB2300 timeout!!
2018/09/17 10:13:14 [error] 19964#19964: MEMSTORE:01: multimsg 0000000003173E90 timeout!!
2018/09/17 10:13:14 [error] 19964#19964: MEMSTORE:01: multimsg 000000000439E410 timeout!!
2018/09/17 10:13:14 [error] 19964#19964: MEMSTORE:01: multimsg 00000000046176E0 timeout!!
2018/09/17 10:13:14 [error] 19964#19964: MEMSTORE:01: multimsg 0000000004617890 timeout!!
2018/09/17 10:13:25 [error] 19964#19964: *1869098 upstream prematurely closed connection while reading response header from upstream, client: 195.35.135.113, server: *.site.com, request: "POST /chats?method=post&secret_key=ZORJ0dnytmM HTTP/1.1", upstream: "http://10.133.28.228:8080/chats?method=post&secret_key=ZORJ0dnytmM", host: "group.site.com", referrer: "https://group.site.com/watch/ZORJ0dnytmM/"
2018/09/17 10:13:32 [alert] 19962#19962: worker process 30542 exited on signal 9
2018/09/17 10:13:32 [alert] 19962#19962: worker process 19964 exited on signal 9

ivanovv avatar Sep 17 '18 10:09 ivanovv

https://github.com/slact/nchan/blob/13e14221ce3e252f4d7c7c47d2dc8d747ec88ae3/src/util/nchan_output.c#L401-L432

Looks like we want to make sure that if the tag is inactive (t[i] == -1) that the tagactive is not equal to any value of i.

This is a bit obscure to me, do we want to make sure that tagactive is -1 and not any other number, or do we specifically want to make sure that it is not the i?

ivanovv avatar Sep 17 '18 12:09 ivanovv

Hey @slact, we had two crashes a couple of days ago.

Is there a workaround to disable this assertion or the code that generates Etag header? We don't store messages, so most of the time it will be 0 or -. Or with multiplexing that plays a role and we can't just try to disable etag generation?

ivanovv avatar Sep 19 '18 22:09 ivanovv

I believe I just fixed this with commit b6513e6eefe910edae6bc8606902b4e8ac279f66 in master. If you can, please test with a build from master.

slact avatar Oct 01 '18 17:10 slact

Rebuilt with latest master and nginx stable

Got the same error:

ker process: /home/deploy/nginx/nchan/src/util/nchan_output.c:424: msgtag_to_strptr: Assertion `id->tagactive != i' failed.
2018/10/08 07:31:54 [alert] 32259#32259: worker process 32261 exited on signal 6 (core dumped)
2018/10/08 07:31:54 [alert] 32259#32259: worker process 19581 exited on signal 11 (core dumped)
2018/10/08 07:31:54 [alert] 32259#32259: worker process 19583 exited on signal 11 (core dumped)
2018/10/08 07:31:54 [alert] 32259#32259: worker process 19585 exited on signal 11 (core dumped)
2018/10/08 07:31:54 [alert] 32259#32259: worker process 19587 exited on signal 11 (core dumped)
2018/10/08 07:31:54 [alert] 32259#32259: worker process 19589 exited on signal 11 (core dumped)
2018/10/08 07:31:54 [alert] 32259#32259: worker process 19591 exited on signal 11 (core dumped)
2018/10/08 07:31:55 [alert] 32259#32259: worker process 19593 exited on signal 11 (core dumped)
2018/10/08 07:31:55 [alert] 32259#32259: worker process 19595 exited on signal 11 (core dumped)
2018/10/08 07:31:55 [alert] 32259#32259: worker process 19597 exited on signal 11 (core dumped)
2018/10/08 07:31:55 [alert] 32259#32259: worker process 19600 exited on signal 11 (core dumped)
2018/10/08 07:31:55 [alert] 32259#32259: worker process 19602 exited on signal 11 (core dumped)
2018/10/08 07:31:55 [alert] 32259#32259: worker process 19605 exited on signal 11 (core dumped)
2018/10/08 07:31:55 [alert] 32259#32259: worker process 19608 exited on signal 11 (core dumped)
2018/10/08 07:31:56 [alert] 32259#32259: worker process 19610 exited on signal 11 (core dumped)
2018/10/08 07:31:56 [alert] 32259#32259: worker process 19612 exited on signal 11 (core dumped)
2018/10/08 07:31:56 [alert] 32259#32259: worker process 19614 exited on signal 11 (core dumped)
2018/10/08 07:31:56 [alert] 32259#32259: worker process 19616 exited on signal 11 (core dumped)
2018/10/08 07:31:56 [alert] 32259#32259: worker process 19618 exited on signal 11 (core dumped)
2018/10/08 07:31:56 [alert] 32259#32259: worker process 19620 exited on signal 11 (core dumped)
2018/10/08 07:31:56 [alert] 32259#32259: worker process 19622 exited on signal 11 (core dumped)
2018/10/08 07:31:56 [alert] 32259#32259: worker process 19624 exited on signal 11 (core dumped)
2018/10/08 07:32:14 [error] 32260#32260: MEMSTORE:00: multimsg 00000000021DC1D0 timeout!!
2018/10/08 07:32:14 [error] 32260#32260: MEMSTORE:00: multimsg 00000000022A32A0 timeout!!
2018/10/08 07:32:14 [error] 32260#32260: MEMSTORE:00: multimsg 00000000022AA3C0 timeout!!

Is there anything I can do to help you investigate this issue?

Maybe ssh'ing to prod can help?

ivanovv avatar Oct 08 '18 10:10 ivanovv

this is pub endpoint log around crash time:

nchan format:

10.133.18.90 [08/Oct/2018:07:31:50 +0000] webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg~admin~PiUDdF5z0FtdB7RNjWcbRw - http 1538983905:0 -
10.133.18.90 [08/Oct/2018:07:31:50 +0000] webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw~admin~Ni51LWCAkNFqvp8zLb1TQg - http 1538983905:0 -
10.133.18.90 [08/Oct/2018:07:31:51 +0000] webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg~admin~PiUDdF5z0FtdB7RNjWcbRw - http 1538983906:0 -
10.133.18.90 [08/Oct/2018:07:31:51 +0000] webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw - http 1538983906:0 -
10.133.28.228 [08/Oct/2018:07:31:52 +0000] webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw - http 1538983907:0 -
10.133.28.228 [08/Oct/2018:07:31:52 +0000] webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw - http 1538983907:1 -
10.133.28.228 [08/Oct/2018:07:31:53 +0000] webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw - http 1538983908:0 -
10.133.18.90 [08/Oct/2018:07:31:55 +0000] webinar~jIfEdCHHOeQRfrUZl5zNcxDVN96BRvSWRA~admin~4da5vX4hKP1doO7babBobg - http 1538983910:0 -
10.133.18.90 [08/Oct/2018:07:31:56 +0000] webinar~jIfEdCHHOeQRfrUZl5zNcxDVN96BRvSWRA~admin~4da5vX4hKP1doO7babBobg - http 1538983911:0 -
10.133.18.90 [08/Oct/2018:07:31:56 +0000] webinar~7Hj857uwXMLpRzvTDGkL7Tn6uJovwOTQqA~admin~2DxLZ6WljbjEn2q9ISJB0A - http 1538983911:0 -
10.133.18.90 [08/Oct/2018:07:31:59 +0000] webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg - http 1538983914:0 -
10.133.28.228 [08/Oct/2018:07:31:59 +0000] webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg - http 1538983914:1 -
10.133.28.228 [08/Oct/2018:07:31:59 +0000] webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg - http 1538983914:2 -
10.133.18.90 [08/Oct/2018:07:31:59 +0000] webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg - http 1538983914:3 -
10.133.28.228 [08/Oct/2018:07:32:05 +0000] webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw - http 1538983920:0 -

Apache format:

10.133.18.90 - - [08/Oct/2018:07:31:50 +0000] "POST /pub/webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg~admin~PiUDdF5z0FtdB7RNjWcbRw HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.18.90 - - [08/Oct/2018:07:31:50 +0000] "POST /pub/webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw~admin~Ni51LWCAkNFqvp8zLb1TQg HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.18.90 - - [08/Oct/2018:07:31:51 +0000] "POST /pub/webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg~admin~PiUDdF5z0FtdB7RNjWcbRw HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.18.90 - - [08/Oct/2018:07:31:51 +0000] "POST /pub/webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.28.228 - - [08/Oct/2018:07:31:52 +0000] "POST /pub/webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.28.228 - - [08/Oct/2018:07:31:52 +0000] "POST /pub/webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.28.228 - - [08/Oct/2018:07:31:53 +0000] "POST /pub/webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.18.90 - - [08/Oct/2018:07:31:55 +0000] "POST /pub/webinar~jIfEdCHHOeQRfrUZl5zNcxDVN96BRvSWRA~admin~4da5vX4hKP1doO7babBobg HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.18.90 - - [08/Oct/2018:07:31:56 +0000] "POST /pub/webinar~jIfEdCHHOeQRfrUZl5zNcxDVN96BRvSWRA~admin~4da5vX4hKP1doO7babBobg HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.18.90 - - [08/Oct/2018:07:31:56 +0000] "POST /pub/webinar~7Hj857uwXMLpRzvTDGkL7Tn6uJovwOTQqA~admin~2DxLZ6WljbjEn2q9ISJB0A HTTP/1.1" 201 101 "-" "http.rb/3.0.0"
10.133.18.90 - - [08/Oct/2018:07:31:59 +0000] "POST /pub/webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg HTTP/1.1" 201 102 "-" "http.rb/3.0.0"
10.133.28.228 - - [08/Oct/2018:07:31:59 +0000] "POST /pub/webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg HTTP/1.1" 201 102 "-" "http.rb/3.0.0"
10.133.28.228 - - [08/Oct/2018:07:31:59 +0000] "POST /pub/webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg HTTP/1.1" 201 102 "-" "http.rb/3.0.0"
10.133.18.90 - - [08/Oct/2018:07:31:59 +0000] "POST /pub/webinar~AK1AfZjoRQcap8Fx3s0iisAz8kMMkSgFGg HTTP/1.1" 201 102 "-" "http.rb/3.0.0"
10.133.28.228 - - [08/Oct/2018:07:32:05 +0000] "POST /pub/webinar~xaaUvrQ6Inm9MguYoSk64eJAc1ye-bhOfw HTTP/1.1" 202 101 "-" "http.rb/3.0.0"

ivanovv avatar Oct 08 '18 10:10 ivanovv

Maybe ssh'ing to prod can help?

Probably the best way to handle this. Email me at [email protected] to set this up.

slact avatar Oct 09 '18 00:10 slact

Any Update on this? I got a similar error

nginx_1                | nginx: worker process: /tmp/nchan-1.2.3/src/util/nchan_output.c:424: msgtag_to_strptr: Assertion `id->tagactive != i' failed.
nginx_1                | 2018/12/12 03:41:32 [notice] 1#0: signal 17 (SIGCHLD) received from 11
nginx_1                | 2018/12/12 03:41:32 [alert] 1#0: worker process 11 exited on signal 6
nginx_1                | 2018/12/12 03:41:32 [alert] 1#0: worker process 11 exited on signal 6
nginx_1                | 2018/12/12 03:41:32 [notice] 1#0: start worker process 12
nginx_1                | 2018/12/12 03:41:32 [notice] 12#0: nchan: Redis node redis:6379 connected
nginx_1                | 2018/12/12 03:41:32 [notice] 12#0: nchan: Redis server redisdb: ready

qhenkart avatar Dec 12 '18 04:12 qhenkart

In this specific case the bug was manifested itself because of the specific log format, it turns out one cannot use arbitrary nchan variables in the log format directive.

Can you paste your nginx config (nchan related part of it)?

And another thing – it looks like you have only one nginx worker process, that's a good thing, if there are several worker processes and one segfaults, then nchan will be broken for all processes.

ivanovv avatar Dec 12 '18 11:12 ivanovv

  upstream redisdb {
    nchan_redis_server ${REDIS_URI};
  }

  location sub/regex/ {
    nchan_subscriber;
    nchan_subscriber_first_message newest;
    nchan_subscriber_timeout 150;
    nchan_subscriber_compound_etag_message_id on;
    nchan_channel_id several multiplexed ids
    nchan_redis_pass redisdb;
  }

location ~ ^/pub/somepath {
    nchan_publisher;
    nchan_channel_id $1;
    nchan_store_messages off;
    add_header Cache-Control 'no-cache, no-store' always;
    nchan_redis_pass redisdb;
  }

@ivanovv

qhenkart avatar Dec 14 '18 03:12 qhenkart

We are also seeing this message;

Assertion failed: id->tagactive != i (/tmp/nchan-1.2.7/src/util/nchan_output.c: msgtag_to_strptr: 424)
2020/12/30 13:04:47 [alert] 1#1: worker process 9 exited on signal 6 (core dumped)

We do not use Redis and are already running multiple single-threaded instances of nchan due to other bugs. We put a load balancer in front of it to mitigate for now but the servers keep on crashing from time to time with the error above.

Our logs are also spammed with the notices below, maybe it is related.

2020/12/30 13:11:22 [error] 9#9: MEMSTORE:00: tried adding WAITING chanhead 000055A47866ADC0 m/channel1/channel2/channel3/channel4 to chanhead_gc. why?

sg-jorrit avatar Dec 30 '20 13:12 sg-jorrit