cgru icon indicating copy to clipboard operation
cgru copied to clipboard

Afanasy Crashing

Open eberrippe opened this issue 1 year ago • 3 comments

Hi @timurhai About every second day we encounter a little afanasy crash. We try to figure out what might be the reason and didnt find anything yet. Maybe you have an idea of what we can do to solve it. These are logs the most common errors we have.

Fri 05 May 18:13.07: Job registered: yyyy"[3554]: xxxx@xxxxx[20] - 11402 bytes.
Fri 05 May 18:13.55: ERROR   EPOLLERR: aereaaaaSFD:20 S:Processing REQ: TJSON[2182]: 123412
SIG INT
Fri 05 May 19:56.02: ERROR   reconnectTask: numtask >= numTasks ( 20 >= 1 )
Fri 05 May 19:56.02: Render:  yyyx@yyyy[37] unix linux 1234 ON 
Fri 05 May 19:56.03: WARNING Client has NOT closed socket first:1234 SFD:19 S:SWaiting REQ: TJSON[215]: 2324424 ANS: TJSON[185308]: Empty address
Fri 05 May 19:56.07: Render: "yyyyyy" - ZOMBIETIME
Fri 05 May 19:56.07: Render Offline: yyyyy@yyyyyy[913] unix linux 123412412 off
Fri 05 May 19:56.09: WARNING Client has NOT closed socket first: 1234124123 SFD:729 S:SWaiting REQ: TJSON[37]: 1234 ANS: TJSON[23792778]: Empty address
Fri 05 May 19:56.09: WARNING Client has NOT closed socket first: 1234123 SFD:13 S:SWaiting REQ: TJSON[37]: 12314123 ANS: TJSON[23792778]: Empty address
corrupted size vs. prev_size
Tue 09 May 16:04.28: Deleting a job: "xxxxxxa"[2120]: yyyy@yyyyy[0] - 68317 bytes.
Tue 09 May 16:04.28: Deleting a job: "xxxxx"[4658]: yyyy@yyyyyyy[1] - 39174 bytes.
malloc(): invalid size (unsorted)
Tue 09 May 16:28.57: Job registered: yxxxyxyxy"[4097]: xxxxx@yyyyyyyy[1148] - 8472 bytes.
Tue 09 May 16:28.57: Job registered: "xxxxxxx"[4098]: yyyy@yyyyyy.[2] - 9550 bytes.
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
Wed 10 May 19:20.11: Deleting a job: xxxx: xxx@yyyy[992] - 8596 bytes.
Wed 10 May 19:20.11: Deleting a job: "xxxx"[5362]: xyyy@yyyy[1101] - 8489 bytes.
ERROR Wed 10 May 19:20.23: Online render with the same name exists:
New render:
 yyyyy@yyyyy[0] unix linux 12341 ON 
Existing render:
 yyyyyx@yyyyy[425] unix linux 1234 ON  P
Wed 10 May 19:20.32: WARNING Client has NOT closed socket first: 123 SFD:6 S:SWaiting REQ: TJSON[117]: 123 ANS: TJSON[143406]: Empty address
Wed 10 May 19:20.48: Render Offline:  yyyyy.@yyyyy[344] unix linux 23456 off
Wed 10 May 19:20.55: Job registered: "xxxxx"[569]: yyyyr@yyyyyy[8] - 5384 bytes.
malloc(): invalid size (unsorted)```
Thu 11 May 12:14.08: WARNING Client has NOT closed socket first: 123 SFD:930 S:SWaiting REQ: TJSON[37]: 123:123ANS: TJSON[20969412]: Empty address
Thu 11 May 12:14.09: WARNING Client has NOT closed socket first: 123 SFD:20 S:SWaiting REQ: TJSON[37]: 123 ANS: TJSON[20969412]: Empty address
Thu 11 May 12:14.12: ERROR   reconnectTask: numblock >= blocksnum ( 1 >= 1 )
Thu 11 May 12:14.12: Render:  yyyyyx@yyyyy[27] unix linux 1234 ON 
ERROR Thu 11 May 12:14.26: Online render with the same name exists:
New render:
 tttttt@yyyyyy[0] unix linux 2345 ON 
Existing render:
 tttttttt@yyyyy[425] unix linux 2345 ON  P
corrupted size vs. prev_size

Thanks a lot and best

Jan

eberrippe avatar May 11 '23 10:05 eberrippe

Hello! Very strange log. I have not see such errors. So, you can't reproduce the bug? What is the version, OS, how much clients?

All clients not close socket, or just some? Try to find "bad" clients. May be some Web browser not closes socket. Try not to use WebGUI at all.

timurhai avatar May 14 '23 16:05 timurhai

I can not intentionally reproduce the bug sadly. We run afserver on

NAME="AlmaLinux"
VERSION="9.1 (Lime Lynx)"
ID="almalinux"

We have a total of 951 hosts.

How can I find out about the Socket state? We only use the webgui very occasionally. What about the web GUI causes problems? We developed some connections to the afserver by ourselves. Is there something you can advice us to keep in mind when doing so?

Thanks Jan

eberrippe avatar May 15 '23 10:05 eberrippe

If you connect to afserver, you should close socket first after the server answer.

https://cgru.readthedocs.io/en/latest/afanasy/server.html#time-wait

Web browsers do not closes sockets sometimes. If you have such big amount of clients, try not to use WebGUI at all. (Somebody can open it just to ckeck something, then forget to close it, and it will produce TIME-WAIT sockets periodically. But may be this is not your case at all.)

timurhai avatar May 15 '23 11:05 timurhai