neat icon indicating copy to clipboard operation
neat copied to clipboard

RUN_ONCE and many flows fails

Open bagder opened this issue 7 years ago • 17 comments

./client_http_run_once -n 80 bsd10.fh-muenster.de

and

./client_http_run_once -n 80 localhost

both show the same symptom (bsd10 is at a 42ms RTT from my test machine). It quickly handles the first 55-56 flows and then gets stuck on 25 or 24 flows remaining, spending 100% CPU time until after some time on_error is called and the application exits.

[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
on_error

Running on Linux kernel 4.9.2, no working IPv6, no SCTP.

bagder avatar Feb 10 '17 15:02 bagder

Hey Daniel, I'm facing the same problem and will try to solve it.

weinrank avatar Feb 13 '17 14:02 weinrank

I found the problem, it was a bug in my application, not related to the library.

weinrank avatar Feb 14 '17 12:02 weinrank

Hey Daniel,

https://github.com/NEAT-project/neat/blob/master/examples/client_http_run_once.c#L110

Here you call neat_stop_event_loop - why? Is this useful if a ctx ist used by several flows?

weinrank avatar Apr 19 '17 12:04 weinrank

I honestly can't remember any good reason for that. Possibly I added it during some debug session to see if it helped and forgot to remove it... !

bagder avatar Apr 19 '17 12:04 bagder

Should be fixed by #334

weinrank avatar Jun 20 '17 12:06 weinrank

It works differently now, but it's still not working fine.

./client_http_run_once -n 80 bsd10.fh-muenster.de on my machine reaches further than before, but it gets stuck and never exits properly. The final lines of the output (before I pressed control-c) is below.

The cutoff line for me: at -n 65 it works, but at -n 66 it hangs. The last few lines in the hanging -n 66 case:

[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
[Flow 29 ended, 0 to go, 0 for ctx 9]
on_close - no flows left for ctx, stopping event loop
[   0.890973][WRN] neat_notify_close - flow already closed - skipping
Cleanup!

Running it with gdb and breaking at that "Cleanup!" point, the stack trace looks like this:

^C
Program received signal SIGINT, Interrupt.
syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38      ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff7970aea in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00007ffff796ee34 in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#3  0x00007ffff7960924 in uv_run () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#4  0x00007ffff7b813ad in neat_close_loop (nc=0x6276e0) at /home/daniel/src/neat/neat_core.c:281
#5  neat_core_cleanup (nc=0x6276e0) at /home/daniel/src/neat/neat_core.c:290
#6  neat_free_ctx (nc=<optimized out>) at /home/daniel/src/neat/neat_core.c:329
#7  0x0000000000401779 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/daniel/src/neat/examples/client_http_run_once.c:324
(gdb) 

bagder avatar Jul 05 '17 12:07 bagder

@bagder @naeemk Not able to recreate the problem. Actually, it seems to work for me. Have anyone checked this issue recently?

karlgrin avatar Dec 19 '17 16:12 karlgrin

@karlgrin I have and I still see the problem. I've used the following command on a Ubuntu 17.10 machine. ./client_http_run_once -n 80 bsd10.fh-muenster.de

weinrank avatar Dec 20 '17 12:12 weinrank

@weinrank @bagder Have continued my investigation of issue #291. When I did a trial with a timeout of 10 ms for the poll() in client_http_run_once it seemed to work ok for very large values. For example, I ran './client_http_run_once -n 200 bsd10.fh-muenster.de' and it took outp.txt a bit less than 4 seconds to complete. You find the output enclosed with this email. My question is basically that I don't really see the problem.

karlgrin avatar Jan 04 '18 13:01 karlgrin

@weinrank have you some time to start looking at this?

karlgrin avatar Jan 09 '18 13:01 karlgrin

How did you change the timeout to 10ms?

weinrank avatar Jan 09 '18 14:01 weinrank

@weinrank Hardcoded '10 ms' in poll at line 293 in file client_http_run_once.c ( int rc = poll(fds, num_ctxs, 10) )

karlgrin avatar Jan 09 '18 15:01 karlgrin

Just wondering: Does this really fix a bug? So can you explain why it did not work without the 10 ms and does work with 10 ms.

I just want to make sure that this is not a work around protecting against a race condition in some scenarios...

tuexen avatar Jan 09 '18 19:01 tuexen

@tuexen I don't consider this a fix of the bug. I only consider this an observation. Since I cannot explain the problem, I don't consider it fixed. I'll continue looking into this issue tomorrow. This problem is that I have not been able to replicate the problem myself. I'll set up a new machine with Ubuntu 17.10 on tomorrow and see if I am able to replicate the problem on that machine.

karlgrin avatar Jan 09 '18 20:01 karlgrin

@karlgrin OK, makes sense. Thanks for the clarification.

I also was able to reproduce the problem sometimes (means that I can run the program multiple times, one only sometimes it hangs, sometimes it works) using FreeBSD head. So the problem is not platform dependent. The system I was using to test has IPv4 and IPv6 addresses. Possibly this is relevant. When the system was hanging, I used sockstat to figure out in which state the TCP connections were. I saw about 78 or so in the state CLOSE_WAIT, which means that there was no close() call issued for them. Using truss to figure out what the program was doing resulted in a loop of kevent() calls.

tuexen avatar Jan 10 '18 14:01 tuexen

The problem is related to the resolver.

When avoiding to use the resolver (i.e., by using an IP-address directly), the problem goes away. For example (towards bsd10.fh-muenster.de):

./client_http_run_once -n 85 212.201.121.100

One of the problems, when opening a large number of connections, is that the maximum number of open files is reached in neat_resolver.c:772. When increasing this number (in /etc/security/limits.conf on Linux) from 1024 to 100000 the problems went away.

I'm not that familiar with the "resolver business" so I hope that someone else could take a look at this @naeemk !?

perhurt avatar Jan 26 '18 09:01 perhurt

This is not a resolve-issue per se, it is an issue of the resolver in combination with operating system limits. The resolver that is currently in NEAT is the first version and quite naive, which means that there are roughly zero optimizations. I haven't looked at the code in detail for a long time, but I suspect that what happens is that one resolver is started for every connection. With 85 connections and multiple interfaces, this will exhaust the system limit on per process file descriptors pretty fast.

There were plans to make the resolver smarter, by for example implementing some sort of caching or do something smarter than attaching the resolver to a flow, but this has never materialized. I unfortunately doubt that I will have time to do anything about the resolver before NEAT is over, but if anyone else wants to have a go then that would be great.

The best solution would probably be to have a small cache, say 10 entries with short TTL, attach the resolver to the context and change the resolver from being on-demand to process a request queue (with a limit N on how many requests can be processed in parallel). We could also attach flows to DNS requests in process, if there are multiple requests for the same domain. The first request to a given domain will then trigger a resolve, and the others will hit the cache/be attached to this request.

kristrev avatar Jan 26 '18 10:01 kristrev