ocaml-cohttp icon indicating copy to clipboard operation
ocaml-cohttp copied to clipboard

"Hello world" HTTP server dies under load

Open magv opened this issue 7 years ago • 30 comments

Hi. I've tried benchmaking cohttp as a potential web server for my needs, but it seems to die under load. I've got two different setups where this happens.

Here's the server I'm testing:

open Lwt
open Cohttp
open Cohttp_lwt_unix

let server =
  let callback _conn req body =
    body |> Cohttp_lwt_body.to_string >|= (fun body -> "Hello world")
    >>= (fun body -> Server.respond_string ~status:`OK ~body ())
  in
  Server.create ~mode:(`TCP (`Port 8000)) (Server.make ~callback ())

let () = ignore (Lwt_main.run server)

(I also tried using cohttp-server-lwt, and it behaves similarly).

Setup 1: FreeBSD, OCaml 4.03, Unix_error(EINVAL, "select", "")

First I build and start the server like so:

$ uname -mrs
FreeBSD 10.1-RELEASE-p26 amd64

$ ocaml -version
4.03.0

$ opam list --installed cohttp conduit lwt
# Installed packages for 4.03.0:
cohttp   0.21.0  HTTP(S) library for Lwt, Async and Mirage
conduit  0.12.0  Network connection library for TCP and SSL
lwt       2.5.2  A cooperative threads library for OCaml

$ ocamlbuild -r main.native -pkgs cohttp.lwt

$ env OCAMLRUNPARAM=b ./main.native -p 8000

Then, on a separate (or the same) machine I run wrk2 a few times:

$ wrk -d20s -t4 -c5k -R10k 'http://<ip>:8000'

... and the server dies with this message:

Fatal error: exception Unix.Unix_error(Unix.EINVAL, "select", "")
Raised by primitive operation at file "src/unix/lwt_engine.ml", line 371, characters 26-60
Called from file "src/unix/lwt_main.ml", line 41, characters 8-82
Called from file "main.ml", line 12, characters 16-37

Setup 2: Linux, OCaml 4.02, Unix_error(EMFILE, "accept", "")

First I build and start the server like this:

$ uname -mrs
Linux 4.6.5-200.fc23.x86_64 x86_64

$ ocaml -version
The OCaml toplevel, version 4.02.2

$ opam list --installed cohttp conduit lwt  
# Installed packages for system:
cohttp   0.20.0  HTTP(S) library for Lwt, Async and Mirage
conduit  0.11.0  Network connection library for TCP and SSL
lwt       2.5.2  A cooperative threads library for OCaml

$ ocamlbuild -r main.native -pkgs cohttp.lwt

$ env OCAMLRUNPARAM=b ./main.native -p 8000

Then from another machine I run wrk2:

$ wrk -t4 -L -c5k -d20s -R10k 'http://<ip>:8000'

... and the server dies:

Fatal error: exception Unix.Unix_error(Unix.EMFILE, "accept", "")
Raised at file "src/core/lwt.ml", line 789, characters 22-23
Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Called from file "main.ml", line 12, characters 16-37

I'm not sure if these two are the same problem, but whatever the case, a web server must not die from too many incoming connections.

magv avatar Aug 16 '16 19:08 magv

If you install libev and the conf-libev opam package then you should be able to handle far more connections. That will cause Lwt to use libev rathe than select for its internal event handling.

hcarty avatar Aug 16 '16 19:08 hcarty

EMFILE -- running out of file descriptors perhaps? Does raising the limit help, eg ulimit -n unlimited?

j0sh avatar Aug 16 '16 19:08 j0sh

@hcarty, sorry, I forgot to mention it: I did install both libev and conf-libev in both setups prior to installing cohttp.

magv avatar Aug 16 '16 19:08 magv

#328 is relevant to this.

seliopou avatar Aug 16 '16 19:08 seliopou

@j0sh, in the first setup, I don't think so; ulimit -n is already at 232659.

In the second setup, probably so; ulimit -n was 1024, raising it does help. This is still a problem though, since I can raise the number of incoming connections too, and the server will still fail.

magv avatar Aug 16 '16 19:08 magv

Does Linux still have the same error after raising the ulimit? If not, what's the new error?

j0sh avatar Aug 16 '16 19:08 j0sh

@j0sh, it's the same error, it just takes a higher number of simultaneous connections to get it.

(ulimit -n unlimited doesn't work on my system; I can only raise it to a finite number).

magv avatar Aug 16 '16 19:08 magv

Looks like on FreeBSD libev is not using kqueue() but is instead falling back to select(), which can only handle 128 fds at a time. (EDIT: probably more than this but the point is it's low.)ulimits won't change that unfortunately.

seliopou avatar Aug 16 '16 19:08 seliopou

See also ocsigen/lwt#87.

seliopou avatar Aug 16 '16 19:08 seliopou

@seliopou, I think FreeBSD port actually forces the kqueue backend... at least it tries to (i.e. this patch), if it succeeds or not -- that I'm not sure.

In any case, this should be a reason for performance degradation, not an outright failure, right?

magv avatar Aug 16 '16 19:08 magv

If the application passes too many fds to select(), it will fail.

seliopou avatar Aug 16 '16 19:08 seliopou

@seliopou, so, which code is responsible for checking the number of fds passed into select? I mean, the problem remains, a web server should not die.

BTW, I've just tried this simple program:

#include <ev.h>
#include <stdio.h>

int main() {
    struct ev_loop *loop = EV_DEFAULT;
    printf("%d\n", ev_backend(loop) == EVBACKEND_KQUEUE);
}

... and it prints 1, so libev does indeed use kqueue backend by default on my FreeBSD. Maybe Lwt overrides this choice somewhere?

magv avatar Aug 16 '16 20:08 magv

I've added backtrace to the FreeBSD failure, and it turns out that the exception originates over here, inside Lwt's "select" engine. In other words, libev-based engine is not used at all on FreeBSD. The cause for that is the libev_default setting in Lwt's _oasis file, which is enabled by default only under Linux. I'm guessing ocsigen/lwt#87 was caused by the same reason.

So, to workaround libev_default being false one could do this:

let () =
  Lwt_engine.set (new Lwt_engine.libev)

With this addition FreeBSD setup stops dying with Unix_error(EINVAL, "select", ""), and starts dying with the same error as the Linux one, Unix_error(EMFILE, "accept", "") (given enough incoming connections).

The backtrace for that exception points here and also here, but neither of those seem relevant.

What need to happen with the EMFILE error is that somewhere in the stack there needs to be code that:

  1. Detects accept failing due to too many open files.
  2. Optionally logs a warning about this.
  3. Adds a delay/retry loop with exponential backoff that will limit the amount of accept calls and will allow accepting new connections once some of the existing ones are closed.

The overall result should be for the server to accept as many incoming connections as it's allowed to, and to continue serving as many connections as possible even after fd limit is reached.

magv avatar Aug 17 '16 09:08 magv

Also, I think it will be useful to document the fact that Lwt's "select" engine is:

  • enabled by default on every non-Linux system;
  • not equipped to serve public-facing traffic.

(Maybe it is already documented somewhere, and I've just missed it?)

magv avatar Aug 17 '16 09:08 magv

@magv I wouldn't mind mentioning this but this is really an Lwt specific thing so I'd like to see what's their documentation on this subject.

rgrinberg avatar Aug 17 '16 16:08 rgrinberg

Lwt's docs do mention that "Unix.select supports only 1024 [file descriptors] at most", so it is sort of covered, but then they also say that "libev is used by default on Unix", which as we've seen above is just false...

Documentation-wise I think it would be useful not so much to focus on describing every quirk of Lwt (or Async, which I suspect is similarly limited, because as far as I can see they only have an epoll backend), but rather to describe how to build a robust server out of cohttp. Under FreeBSD, for example, this currently means using Lwt with conf-libev, and manually forcing the Lwt_engine.libev engine... provided EMFILE behavior is fixed, and no other problem crops up, of course.

Speaking of which, is EMFILE problem something that can be fixed in cohttp, or is it further up the stack (Lwt? conduit? something else?)?

magv avatar Aug 17 '16 19:08 magv

I will investigate the EMFILE issue.

In the meantime, if you'd like to help out it would be nice to see if the issue is reproducible with Async.

I can whip up an Async example as well, if you don't have any experience with it. But we have plenty of samples in the repo so it shouldn't be too bad.

On 08/17, Vitaly Magerya wrote:

Lwt's docs do mention that "Unix.select supports only 1024 [file descriptors] at most", so it is sort of covered, but then they also say that "libev is used by default on Unix", which as we've seen above is just false...

Documentation-wise I think it would be useful not so much to focus on describing every quirk of Lwt (or Async, which I suspect is similarly limited, because as far as I can see they only have an epoll backend), but rather to describe how to build a robust server out of cohttp. Under FreeBSD, for example, this currently means using Lwt with conf-libev, and manually forcing the Lwt_engine.libev engine... provided EMFILE behavior is fixed, and no other problem crops up, of course.

Speaking of which, is EMFILE problem something that can be fixed in cohttp, or is it further up the stack (Lwt? conduit? something else?)?

You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/mirage/ocaml-cohttp/issues/503#issuecomment-240516184

rgrinberg avatar Aug 17 '16 20:08 rgrinberg

As for FreeBSD, I have no experience with that fine OS so contributions describing what's necessary would be very welcome.

On 08/17, Vitaly Magerya wrote:

Lwt's docs do mention that "Unix.select supports only 1024 [file descriptors] at most", so it is sort of covered, but then they also say that "libev is used by default on Unix", which as we've seen above is just false...

Documentation-wise I think it would be useful not so much to focus on describing every quirk of Lwt (or Async, which I suspect is similarly limited, because as far as I can see they only have an epoll backend), but rather to describe how to build a robust server out of cohttp. Under FreeBSD, for example, this currently means using Lwt with conf-libev, and manually forcing the Lwt_engine.libev engine... provided EMFILE behavior is fixed, and no other problem crops up, of course.

Speaking of which, is EMFILE problem something that can be fixed in cohttp, or is it further up the stack (Lwt? conduit? something else?)?

You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/mirage/ocaml-cohttp/issues/503#issuecomment-240516184

rgrinberg avatar Aug 17 '16 20:08 rgrinberg

Here's the async server:

open Core.Std
open Async.Std
open Cohttp_async

let handler ~body:_ _ _ =
  Server.respond_with_string "Hello world"

let () =
  ignore (Cohttp_async.Server.create (Tcp.on_port 8000) handler);
  never_returns (Scheduler.go ())

Here's the load generator:

$ wrk -L -t2 -c1k -R1k -d30s "http://localhost:8000/"

On FreeBSD 10 amd64, with OCaml 4.03.0, async 113.33.00, conduit 0.12.0, and cohttp 0.21.0, the server dies pretty much instantly with this message:

(((pid 22030) (thread_id 0))
 ((human_readable 2016-08-18T13:09:12+0300)
  (int63_ns_since_epoch 1471514952681219000))
 "unhandled exception in Async scheduler"
 ("unhandled exception"
  ((src/monitor.ml.Error_
    ((exn
      (Unix.Unix_error "Connection reset by peer" shutdown
       "((fd 267) (mode SHUTDOWN_ALL))"))
     (backtrace
      ("Raised at file \"src/core_result.ml\", line 111, characters 23-26"
       "Called from file \"src/deferred1.ml\", line 14, characters 63-68"
       "Called from file \"src/job_queue.ml\", line 160, characters 6-47" ""))
     (monitor
      (((name Monitor.protect) (here ()) (id 2085) (has_seen_error true)
        (is_detached true))))))
   ((pid 22030) (thread_id 44)))))

Maybe I'm doing something wrong here? "Connection reset by peer" is a pretty trivial error after all.

magv avatar Aug 18 '16 10:08 magv

Under Linux (Fedora 23, amd64), with OCaml 4.02.2, async 113.00.00, conduit 0.9.0, and cohttp 0.19.3, the server dies with a similar error message, but only if I kill wrk before it's finished. Here's the message:

(((pid 23262) (thread_id 0))
 ((human_readable 2016-08-18T20:22:59+0300)
  (int63_ns_since_epoch 1471540979931848000))
 "unhandled exception in Async scheduler"
 ("unhandled exception"
  ((src/monitor.ml.Error_
    ((exn
      ("writer error"
       ((src/monitor.ml.Error_
         ((exn
           (Unix.Unix_error "Connection reset by peer"
            writev_assume_fd_is_nonblocking ""))
          (backtrace
           ("Raised at file \"src/writer0.ml\", line 596, characters 49-52"
            "Called from file \"src/job_queue.ml\", line 164, characters 6-47"
            ""))
          (monitor
           (((name (id 6921)) (here ()) (id 6921) (has_seen_error true)
             (is_detached true) (kill_index 0))
            ((name main) (here ()) (id 1) (has_seen_error true)
             (is_detached false) (kill_index 0))))))
        ((id 986)
         (fd
          ((file_descr 993)
           (info
            (socket
             ((listening_on
               ((type_
                 ((family
                   ((family PF_INET) (address_of_sockaddr_exn <fun>)
                    (sexp_of_address <fun>)))
                  (socket_type SOCK_STREAM)))
                (fd
                 ((file_descr 3)
                  (info
                   (replaced
                    (listening
                     (previously_was
                      (replaced
                       ((socket (bound_on (Inet (0.0.0.0 8000))))
                        (previously_was
                         (socket
                          ((family
                            ((family PF_INET) (address_of_sockaddr_exn <fun>)
                             (sexp_of_address <fun>)))
                           (socket_type SOCK_STREAM))))))))))
                  (kind (Socket Passive)) (supports_nonblock true)
                  (have_set_nonblock true)
                  (state
                   (Close_requested
                    ((monitor
                      (((name main) (here ()) (id 1) (has_seen_error true)
                        (is_detached false) (kill_index 0))))
                     (priority Normal) (local_storage ())
                     (backtrace_history ()) (kill_index 0))
                    <fun>))
                  (watching ((read Stop_requested) (write Not_watching)))
                  (watching_has_changed true) (num_active_syscalls 1)
                  (close_finished Empty)))))
              (client (Inet (127.0.0.1 48614))))))
           (kind (Socket Active)) (supports_nonblock true)
           (have_set_nonblock true) (state Closed)
           (watching ((read Not_watching) (write Not_watching)))
           (watching_has_changed false) (num_active_syscalls 0)
           (close_finished (Full ()))))
         (monitor
          (((name (id 6920)) (here ()) (id 6920) (has_seen_error true)
            (is_detached true) (kill_index 0))
           ((name main) (here ()) (id 1) (has_seen_error true)
            (is_detached false) (kill_index 0))))
         (inner_monitor
          (((name (id 6921)) (here ()) (id 6921) (has_seen_error true)
            (is_detached true) (kill_index 0))
           ((name main) (here ()) (id 1) (has_seen_error true)
            (is_detached false) (kill_index 0))))
         (background_writer_state Stopped_permanently) (syscall Per_cycle)
         (bytes_received 1036) (bytes_written 1025) (scheduled <opaque>)
         (scheduled_bytes 0) (buf <opaque>) (scheduled_back 0) (back 0)
         (flushes <opaque>) (close_state Closed) (close_finished (Full ()))
         (close_started (Full ())) (producers_to_flush_at_close ())
         (flush_at_shutdown_elt (<opaque>))
         (check_buffer_age
          (((writer <opaque>) (queue ()) (maximum_age 2m) (too_old Empty))))
         (consumer_left (Full ())) (raise_when_consumer_leaves true)
         (open_flags (Full (Ok (rdwr))))))))
     (monitor
      (((name (id 6920)) (here ()) (id 6920) (has_seen_error true)
        (is_detached true) (kill_index 0))
       ((name main) (here ()) (id 1) (has_seen_error true)
        (is_detached false) (kill_index 0))))))
   ((pid 23262) (thread_id 0)))))

If I don't kill it, there's also an error related to fd limit:

(((pid 23326) (thread_id 0))
 ((human_readable 2016-08-18T20:25:39+0300)
  (int63_ns_since_epoch 1471541139008687000))
 "unhandled exception in Async scheduler"
 ("unhandled exception"
  ((src/monitor.ml.Error_
    ((exn (Unix.Unix_error "Too many open files" accept "((fd 3))"))
     (backtrace
      ("Raised at file \"src/unix_syscalls.ml\", line 818, characters 28-31"
       "Called from file \"src/deferred1.ml\", line 117, characters 6-13"
       "Called from file \"src/deferred0.ml\", line 52, characters 2-10"
       "Called from file \"src/unix_syscalls.ml\", line 822, characters 4-61"
       "Called from file \"src/tcp.ml\", line 253, characters 6-28"
       "Called from file \"src/tcp.ml\", line 266, characters 10-24"
       "Called from file \"src/job_queue.ml\", line 164, characters 6-47" ""))
     (monitor
      (((name main) (here ()) (id 1) (has_seen_error true)
        (is_detached false) (kill_index 0))))))
   ((pid 23326) (thread_id 0)))))
libgcc_s.so.1 must be installed for pthread_cancel to work
Abort (core dumped)

Note the core dump. Also don't believe the complaint about missing libgcc_s.so.1, it's definitely there in /usr/lib.

magv avatar Aug 18 '16 17:08 magv

I will investigate the EMFILE issue.

I just hit this myself. The error I get is: (Unix.Unix_error "Too many open files" accept "") on Arch Linux using libev. I'm using cohttp with Lwt.

So this is a limitation with cohttp, and nothing I need to configure, is that right?

EDIT: After investigating a bit more, yes, cohttp just needs to handle this and retry if accept fails as mentioned above. If I find time in the next month or so, I might try to implement it (no promises though :))

jlongster avatar Nov 16 '16 15:11 jlongster

Just encountered this issue when adding mirage as a test in these benchmarking trials https://github.com/TechEmpower/FrameworkBenchmarks/compare/master...ciarancourtney:mirageos Is there any workaround?

ciarancourtney avatar Aug 16 '17 13:08 ciarancourtney

Have you tried using the ~on_exn handler on Server.create? Did it catch anything?

SGrondin avatar Aug 16 '17 18:08 SGrondin

Hi @SGrondin I resolved my crash issues by combing 2 workarounds actually https://github.com/TechEmpower/FrameworkBenchmarks/pull/2938/files

ciarancourtney avatar Aug 16 '17 18:08 ciarancourtney

@ciarancourtney I was about to suggest set_max_active - I've had to do the same for projects locally.

Glad those two fixed the crash.

hcarty avatar Aug 16 '17 18:08 hcarty

Should we maybe fix this in Lwt, by always recommending conf-libev, maybe issuing warnings if it's not installed? Are there any platforms where libev is not available, that we are trying to support? If not, can we drop using select() by default at all? Can we add some configuration steps to libev and/or Lwt, to make sure we always use the right backend for each system?

Lwt already recommends always installing libev and conf-libev. For example, the installation instructions.

aantron avatar Aug 16 '17 19:08 aantron

@aantron There are a few parts to this. Even with libev the issue will come up if set_max_active isn't used and there are a large number of open connections. Same for the exceptions thrown into Lwt's async pool. And for simple projects/testing, using the select backend works well enough. For some definition of "well enough".

@fdopen's opam-for-windows repo has libev disabled under Windows (currently the last line of https://github.com/fdopen/opam-repository-mingw/blob/master/packages/conf-libev/conf-libev.4-11/opam). It would be nice to fix/address this before pushing too hard on a libev requirement.

hcarty avatar Aug 16 '17 19:08 hcarty

Fair enough. If it's ever the right time to require libev, please let me know.

There is also of course the impending libuv thing.

aantron avatar Aug 16 '17 20:08 aantron

I don't think there is any point in using libev instead of select on Windows. libev can only use select on windows anyway; and windows's rudimentary select supports only sockets, whereas OCaml's select emulation will also work with other kinds of HANDLE.

Beside this, the current lwt bindings would be broken on windows anyway. Therefore I've just disabled the package completely.

fdopen avatar Aug 19 '17 17:08 fdopen

@fdopen Thank you for the explanation!

hcarty avatar Aug 19 '17 22:08 hcarty