cgru Server crashes regulary

Hey, hey @timurhai

The issues I talk about here were mentioned initially over here https://github.com/CGRU/cgru/issues/579 I work for the same studio and currently take care of render farm related issues.

Context

Afanasy version: 3.3.0 Studio size: 5 locations across two countries Afanasy server: 1 at the headquarters, which distributes the jobs to all other locations. afserver runs in a AlmaLinux9.4 VM Jobs: 3000-4000 on average Render clients: 1042 - most on Centos7.9, some on AlmaLinux9.4 Monitoring clients: 125 - most on Centos7.9, some on Windows10, a few on AlmaLinux9.4 For monitoring, we use an in-house build tool which interacts with the Afanasy python API to query data from the server and manage jobs.

Troubleshooting

After I spent quite some time further debugging the warnings and errors, it looks like I was able to at least figure out the cause of those EPOLLERR messages as well as most of the Client has NOT closed socket first warnings.

I increased the timeout here to 10 seconds https://github.com/CGRU/cgru/blob/402d0f71913cd900225dc10b26af1c5e0567dc59/afanasy/src/server/socketsprocessing.cpp#L556

This greatly reduces the amount of Warnings and Errors displayed in the error log - so it did help I suppose. And according to the studio size, it makes some sense to me that 2 seconds could be a bit too tight of a timeout here. By now, We also know that most of these warnings come from our windows workstations - but no clue yet why, and I guess that's an issue we created ourselves. Let me know if changing this value could have unwanted side effects.

Remaining issues

While this reduces the amount of crashes a bit, they still persist and seem to be memory related. These are some examples I collected over the last week. So far, I found no way to reproduce these kinds of crashes - in case you got any ideas how to pinpoint further what the actual issue is here, let me know. My C++ knowledge is pretty limited, so I didn't do any deeper troubleshooting on this yet.

Wed 21 May 16:55.26: WARNING Client has NOT closed socket first: <ip-address>:<port> SFD:126 S:SWaiting REQ: TJSON[100]: <ip-address>:<port> ANS: TJSON[21442336]: Empty address
Wed 21 May 16:55.26: Job registered: "<job-name>"[3440]: <user>@<workstation>[18] - 16434 bytes.
corrupted size vs. prev_size
corrupted size vs. prev_size
corrupted size vs. prev_size
corrupted size vs. prev_size

Fri 23 May 03:52.09: WARNING Client has NOT closed socket first: <ip-address>:<port> SFD:21 S:SWaiting REQ: TJSON[95]: <ip-address>:<port> ANS: TJSON[95446]: Empty address
Fri 23 May 03:54.13: Job registered: "<job-name>"[466]: <user>@<workstation>[0] - 73052 bytes.
Fri 23 May 03:54.16: Job registered: "<job-name>"[573]: <user>@<workstation>[42] - 42536 bytes.
Fri 23 May 03:55.34: Deleting a job: "<job-name>"[1939]: <user>@<workstation>[2] - 10024 bytes.
malloc(): invalid next size (unsorted)

Fri 23 May 06:36.05: WARNING Client has NOT closed socket first: <ip-address>:<port> SFD:17 S:SWaiting REQ: TJSON[64]: <ip-address>:<port> ANS: TJSON[21130]: Empty address
Fri 23 May 06:45.32: Deleting a job: "<job-name>"[526]: <user>@<workstation>[0] - 23536 bytes.
Fri 23 May 06:45.32: Deleting a job: "<job-name>"[739]: <user>@<workstation>[76] - 74743 bytes.
free(): invalid next size (normal)

Fri 23 May 19:40.58: WARNING Client has NOT closed socket first: <ip-address>:<port> SFD:5 S:SWaiting REQ: TJSON[125]: <ip-address>:<port> ANS: TJSON[114787]: Empty address
Fri 23 May 19:41.52: Job registered: "<job-name>"[42]: <user>@<workstation>[11] - 15467 bytes.
Fri 23 May 19:41.52: Job registered: "<job-name>"[151]: <user>@<workstation>[12] - 22796 bytes.
Fri 23 May 19:44.36: Job registered: "<job-name>"[575]: <user>@<workstation>[542] - 16189 bytes.
Fri 23 May 19:44.36: Job registered: "<job-name>"[650]: <user>@<workstation>[543] - 16189 bytes.
Fri 23 May 19:44.36: Job registered: "<job-name>"[735]: <user>@<workstation>[26] - 10220 bytes.
corrupted size vs. prev_size

Sat 24 May 09:25.38: Deleting a job: "<job-name>"[1570]: <user>@<workstation>[80] - 11361 bytes.
Sat 24 May 09:25.38: Deleting a job: "<job-name>"[1408]: <user>@<workstation>[81] - 11321 bytes.
Sat 24 May 09:25.38: Deleting a job: "<job-name>"[1008]: <user>@<workstation>[145] - 166139 bytes.
Sat 24 May 09:25.52: Job registered: "<job-name>"[56]: <user>@<workstation>[782] - 75996 bytes.
Sat 24 May 09:25.52: Job registered: "<job-name>"[442]: <user>@<workstation>[783] - 75927 bytes.
Sat 24 May 09:25.53: Job registered: "<job-name>"[692]: <user>@<workstation>[784] - 75927 bytes.
malloc(): invalid next size (unsorted)

ERROR Mon 26 May 18:33.58: Online render with the same name exists:
New render:
 <user>@<workstation>[0] unix linux 64 CentOS Linux 7 3.3.0 <ip-address> ON 
Existing render:
 <user>@<workstation>[946] unix linux 64 CentOS Linux 7 3.3.0 <ip-address> ON  n
Mon 26 May 18:33.58: Job registered: "<job-name>"[2705]: <user>@<workstation>[41] - 327217 bytes.
Mon 26 May 18:34.00: Job registered: "<job-name>"[2718]: <user>@<workstation>[42] - 327076 bytes.
Mon 26 May 18:34.01: Job registered: "<job-name>"[2728]: <user>@<workstation>[24] - 45067 bytes.
Mon 26 May 18:34.01: Monitor deregister: <user>@<workstation>[74] e'VERSION=2025.05.22;REGISTERED=2025-05-26 11:30:34;OS=Linux;QTBINDING=PyQt5' <ip-address>
Mon 26 May 18:34.02: Job registered: "<job-name>"[2762]: <user>@<workstation>[43] - 327217 bytes.
Mon 26 May 18:34.03: Job registered: "<job-name>"[2770]: <user>@<workstation>[1] - 72659 bytes.
Mon 26 May 18:34.03: Job registered: "<job-name>"[2782]: <user>@<workstation>[276] - 9806 bytes.
Mon 26 May 18:34.04: Job registered: "<job-name>"[2786]: <user>@<workstation>[44] - 327620 bytes.
malloc(): invalid size (unsorted)

Tue 27 May 12:52.21: WARNING Client has NOT closed socket first: <ip-address>:<port> SFD:8 S:SWaiting REQ: TJSON[120]: <ip-address>:<port> ANS: TJSON[91437]: Empty address
Tue 27 May 12:52.22: Job registered: "<job-name>"[2317]: <user>@<workstation>[166] - 11636 bytes.
Tue 27 May 12:52.24: Job registered: "<job-name>"[2321]: <user>@<workstation>[3] - 15216 bytes.
Tue 27 May 12:52.43: WARNING Client has NOT closed socket first: <ip-address>:<port> SFD:190 S:SWaiting REQ: TJSON[178]: <ip-address>:<port> ANS: TJSON[157422]: Empty address
Tue 27 May 12:52.46: Deleting a job: "<job-name>"[376]: <user>@<workstation>[0] - 29519 bytes.
Tue 27 May 12:52.46: Deleting a job: "<job-name>"[378]: <user>@<workstation>[1] - 29405 bytes.
Tue 27 May 12:52.47: Job registered: "<job-name>"[376]: <user>@<workstation>[3] - 160275 bytes.
malloc(): invalid size (unsorted)

A recent crash from today seems to be in context of renders being registered

Wed 28 May 08:28.13: Render Offline:  <workstation>@<user>[248] unix linux 64 CentOS Linux 7 3.3.0 <ip-address> off N
Wed 28 May 08:28.15: Render Offline:  <workstation>@<user>[302] unix linux 64 CentOS Linux 7 3.3.0 <ip-address> off N
Wed 28 May 08:28.18: Render Offline:  <workstation>@<user>[210] unix linux 64 CentOS Linux 7 3.3.0 <ip-address> off N
Wed 28 May 08:28.23: Render Offline:  <workstation>@<user>[146] unix linux 64 CentOS Linux 7 3.3.0 <ip-address> off N
Wed 28 May 08:28.23: Render:  <workstation>@<user>[539] unix linux 64 CentOS Linux 7 3.3.0 <ip-address> ON 
free(): invalid next size (normal)

Thanks for reading and have a great day

May 27 '25 15:05 philnewm

Hello, @philnewm !

It can be useful to make that amount of seconds configurable.
I have not faced that memory bug. I should be able to repeat it to fix. Can you create a simple job within Python that definitely crashes server, or even can crash it, if send it 100 times.

May 28 '25 15:05 timurhai

Hi @timurhai and thanks for the reply

I'm still trying to reproduce the bug, my current plan is this

Recreate the last job registered before the crash and send it from python script (Unlickely to cause the crash)
Add debug output to the server to display received requests
Evaluate test results and continue further

My knowledge around debugging such scenarios is fairly limited. If you got any other ideas for debugging approaches, let me know

Jun 02 '25 07:06 philnewm

Hi @philnewm ! The best way is to find how to reproduce the bug.

Jun 02 '25 09:06 timurhai

Hello again @timurhai

I totally agree on that - even tho I was so far not able to reproduce the issue itself, I ran afserver with valgrind to check for memory issues. Since these crashes I mentioned, indicate some kind of memory leak or similar.

So I ran this command

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt afanasy/bin/afserver.9

And as soon as I started sending test-jobs to the process (using the python API) or letting our monitoring client (also using the python API) query data from the server, valgrind started detecting issues. Will attach the full output file, but these were the most interesting lines

==341379== Conditional jump or move depends on uninitialised value(s)
==341379==    at 0x4851C76: rawmemchr (vg_replace_strmem.c:1576)
==341379==    by 0x4FB10D5: _IO_str_init_static_internal (in /usr/lib64/libc.so.6)
==341379==    by 0x4F86F52: __isoc99_sscanf (in /usr/lib64/libc.so.6)
==341379==    by 0x4FF19C: af::processHeader(af::Msg*, int) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4938EF: SocketItem::readData() (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x495520: SocketsProcessing::doEpoll() (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x49569B: SocketsProcessing::ThreadFuncEpoll(void*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4CE739: DlThread::thread_routine(void*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4FB5C01: start_thread (in /usr/lib64/libc.so.6)
==341379==    by 0x5039ED3: clone (in /usr/lib64/libc.so.6)
==341379==  Uninitialised value was created by a heap allocation
==341379==    at 0x4846723: operator new[](unsigned long) (vg_replace_malloc.c:725)
==341379==    by 0x4E86AE: af::Msg::allocateBuffer(int, int, int) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4E9600: af::Msg::Msg(sockaddr_storage const*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x49039D: SocketItem::SocketItem(int, sockaddr_storage*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x491FFF: SocketsProcessing::acceptSocket(int, sockaddr_storage*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4A877C: threadAcceptPort(void*, int) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4CE739: DlThread::thread_routine(void*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4FB5C01: start_thread (in /usr/lib64/libc.so.6)
==341379==    by 0x5039ED3: clone (in /usr/lib64/libc.so.6)
==341379== 
==341379== Invalid read of size 1
==341379==    at 0x4851C74: rawmemchr (vg_replace_strmem.c:1576)
==341379==    by 0x4FB10D5: _IO_str_init_static_internal (in /usr/lib64/libc.so.6)
==341379==    by 0x4F86F52: __isoc99_sscanf (in /usr/lib64/libc.so.6)
==341379==    by 0x4FF19C: af::processHeader(af::Msg*, int) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4938EF: SocketItem::readData() (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x495520: SocketsProcessing::doEpoll() (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x49569B: SocketsProcessing::ThreadFuncEpoll(void*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4CE739: DlThread::thread_routine(void*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4FB5C01: start_thread (in /usr/lib64/libc.so.6)
==341379==    by 0x5039ED3: clone (in /usr/lib64/libc.so.6)
==341379==  Address 0xf82cac80 is 0 bytes after a block of size 16,384 alloc'd
==341379==    at 0x4846723: operator new[](unsigned long) (vg_replace_malloc.c:725)
==341379==    by 0x4E86AE: af::Msg::allocateBuffer(int, int, int) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4E9600: af::Msg::Msg(sockaddr_storage const*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x49039D: SocketItem::SocketItem(int, sockaddr_storage*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x491FFF: SocketsProcessing::acceptSocket(int, sockaddr_storage*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4A877C: threadAcceptPort(void*, int) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4CE739: DlThread::thread_routine(void*) (in /raid/standards/base-setups/cgru/3.3.0/afanasy/bin/afserver.9)
==341379==    by 0x4FB5C01: start_thread (in /usr/lib64/libc.so.6)
==341379==    by 0x5039ED3: clone (in /usr/lib64/libc.so.6)

valgrind-out.txt

Let me know if that output helps you already to identify any issues.

I'm still working my way through this and check if I can find any way to bring this in direct context with the crashes. Just wanted to keep this issue updated.

Jun 05 '25 11:06 philnewm

Hello, @philnewm ! valgrind show lots of errors, especially that can cause a memory leak. We do not have a memory leak, but we have a double free() or something very serious that causes a program hung. She should see logs just before a hung. And think that a common traceback will be enough (we just need to know a latest function call and when it was called).

btw, What is going on with memory on server before a hung? May be it is memory leak really?

Jun 06 '25 10:06 timurhai

I created a simple case that seems to reproduce the above valgrind errors from Msg::allocateBuffer: https://github.com/grische/cgru/tree/valgrind-repro-case

Jun 06 '25 13:06 grische

We do not need to fix all errors that valgrind can show, especially with such flags. For example, it show all uninitiated variables, but it is not an error. Error is to read variable w/o initialization. Try to run your favorite program, that you consider as stable, with such valgrind flags, you will see lots of messages, that you consider as errors.

Jun 06 '25 15:06 timurhai

We should be able to reproduce a bug. Then we will find an error, may be with valgrind, may be not. But we should not read any log or fix anything w/o a bug.

Jun 06 '25 16:06 timurhai