iipsrv icon indicating copy to clipboard operation
iipsrv copied to clipboard

requests "hang" when nginx uses keepalive connections

Open nicolasfranck opened this issue 4 years ago • 16 comments

When I use an upstream with keepalive connections in nginx like this ..

upstream iipsrv {
  least_conn;
  keepalive 8;
  keepalive_requests 50;
  keepalive_timeout 60s;
  # ip's not important
  server 127.0.0.1:9000 fail_timeout=10s;
  server 128.0.0.1:9000 fail_timeout=10s;
}

(and set proxy_http_version 1.1;proxy_set_header Connection ""; and also fastcgi_keep_conn on) (cf. http://nginx.org/en/docs/http/ngx_http_upstream_module.html)

.. then a lot requests seem to "hang". Some succeed, a lot receive a timeout. Once I turn it off, it starts working again.

Any idea why this happens?

Would be a performance gain if connections could be reused? TCP connections are often left in state TIME_WAIT after closing (for a few seconds!), and that may lead to exhaustion of the remaining open ports..

nicolasfranck avatar Dec 11 '20 15:12 nicolasfranck

When using load balancing methods other than the default round-robin method, it is necessary to activate them before the keepalive directive.

http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive

I don't know if this is actually the problem, but I just noticed this note and thought of this issue. It looks like you're activating the keepalive before the server statements, and not using the default method. Might be worth checking either a) moving the server declarations above the keepalive config; or b) removing the least_conn config to see if that's the problem?

ahankinson avatar Feb 04 '21 09:02 ahankinson

@ahankinson unfortunately, that did not make a difference

nicolasfranck avatar Feb 15 '21 20:02 nicolasfranck

I tried to mimic this in golang, and there I see the same problem.

See the following fcgi client program:

package main

import (
  "os"
  "github.com/iwind/gofcgi/pkg/fcgi"
  "io/ioutil"
  "bufio"
  "fmt"
  "time"
)

func main(){

  params := map[string]string{
    "SERVER_SOFTWARE": "gofcgi/1.0.0",
    "REMOTE_ADDR":     "127.0.0.1",
    "SCRIPT_NAME":     "",
    "REQUEST_URI":     "/iipsrv",
    "QUERY_STRING":    "",
    "SERVER_NAME":     "localhost",
    "SERVER_ADDR":     "127.0.0.1:80",
    "SERVER_PORT":     "80",
    "HTTP_HOST":       "localhost",
    "REQUEST_METHOD":  "GET",
  }

  // retrieve shared pool
  // IMPORTANT: iipsrv does not work properly with pooled connections (halts after 13 or so requests)
  //            also proved with nginx upstream "fastcgi_keep_conn" that leads to non working connection after a while
  pool := fcgi.SharedPool("tcp", "127.0.0.1:9000", 4)

  buf := bufio.NewWriter(os.Stdout)
  defer buf.Flush()

  for i := 0; i < 1000; i += 1 {

    // write loop iterator to stderr
    fmt.Fprintf(os.Stderr, "i: %v\n", i)

    // create new client
    client, err := pool.Client()
    if err != nil {
        return
    }

    // create a request
    req := fcgi.NewRequest()
    req.SetTimeout(5 * time.Second)
    req.SetParams(params)

    // call request
    resp, _, err := client.Call(req)
    if err != nil {
        fmt.Fprintf(os.Stderr, "noooo!\n")
        return
    }

    // read data from response
    data, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        return
    }

    // write response data to stdout
    buf.Write(data)

  }

}

This program connects to localhost:9000 (iipsrv), and writes the response to the stdout, and it repeats this a thousand times. After a few requests (10 or 30) connections start to fail, and the program gives up. If I set the number of pooled connections to 1, everything works ok, just like I experienced in nginx.

I tried my own fcgi server program in order to see where the problem lies..

#include <iostream>
#include <fcgio.h>
#include <fcgiapp.h>

void handle_request(FCGX_Request& request){
    fcgi_streambuf cout_fcgi_streambuf(request.out);
    std::ostream os{&cout_fcgi_streambuf};

    os << "HTTP/1.1 200 OK\r\n"
       << "Content-type: text/plain\r\n\r\n"
       << "Hello!\r\n";
}

int main(){
    FCGX_Request request;
    FCGX_Init();
    FCGX_InitRequest(&request, 0, 0);
    while (FCGX_Accept_r(&request) == 0) {
        handle_request(request);
    }
}

(build with g++ -std=c++11 -lfcgi -lfcgi++ -o fcgi_example fcgi_example.cpp on centos)

and that program had the same problem, when client program tried to connect to it a few times in a row. Are pooled connections actually supported in fcgi? I could not find any documentation about them (documentation about fcgi is difficult to find anyway).

It is not that the connections are slow, but in Linux you can hit an open file limit: every time the receiver closes the connection, it enters a state of TIME_WAIT, during which the connection cannot be used for a few seconds.

nicolasfranck avatar Sep 05 '21 19:09 nicolasfranck

What is actually the recommended way to setup fcgi applications like this? The only way I managed to setup the iipsrv in a more robust way is like this:

  • put iipsrv on a server instance, running on a unix socket
  • put a proxy server (like nginx) in front of it on the same server. As there is no TCP stack overhead, you cannot deplete your client ports
  • repeat that setup for every (replicated) server instance. So server1:80, server2:80 ...
  • add a extra server instance that hosts a proxy server (again) that splits requests between the replicated instances, and make sure HTTP 1.1 is used

Before this, that last server instance was splitting requests to server1:9000 (iipsrv), server2:9000 .. but those were fcgi connections, that are closed by iipsrv every time. The round robin method with fcgi only seems to work for php-fcgi, weird..

nicolasfranck avatar Dec 16 '21 08:12 nicolasfranck

I tried with supervisord, like @ahankinson suggests here, and that seems to work as expected. Now the keepalive connections from nginx to iipsrv seem to work.

Strange, the same doesn't work with spawn-fcgi...

Anyway, when I start the iipsrv with spawn-fcgi, that last program is nowhere to be seen. Isn't an fcgi program supposed to have a fcgi process manager?

P.S. I'm running this on a Centos 7

nicolasfranck avatar Dec 17 '21 21:12 nicolasfranck

I tried to tweak the number of workers processes nginx uses, and I've set it to 1. In that case keep alive connections from nginx to iipsrv work also. Weird that that makes the difference. I tried looking into the fcgi documentation, but could not find anything to configure. Everything is set correctly. It is like something goes wrong when using multiple workers..

nicolasfranck avatar Dec 19 '21 14:12 nicolasfranck

I had a talk with a user in the nginx forum. Apparently every nginx worker ties to a connection with a fcgi worker when using keepalive. If there is only one fcgi worker, then all of the other nginx workers will have to wait until fastcgi_connect_timeout is reached, and this is why it "hangs". This means that I have to increase the number of fcgi workers using spawn-fcgi.

This also means that, every time I add a new nginx proxy to my fcgi ensemble, I have to increase the number of fcgi workers in order to avoid this scenario.

I know: this is not really an iipsrv issue, but it seems an interesting topic to talk about when it comes to deployment. I have been using iipsrv for years, and it has always been running with one worker (as I thought OPM was used for handling requests in paralllel) and the need for workers was a bit unclear to me. Something for the docs?

nicolasfranck avatar Dec 20 '21 22:12 nicolasfranck

Interesting. Using supervisord I can spawn a number of IIP processes all listening on port 9000; I'm assuming supervisor has an internal routing process. I'm assuming that's why it works well -- nginx connects to the 'single' instance, but then internally routes it to a waiting backend?

ahankinson avatar Dec 21 '21 09:12 ahankinson

Besides, I discovered this tool: https://github.com/lighttpd/multiwatch. Already available on centos 7, and requires you to start the iipsrv with spawn-fcgi:

# set environment first of course
export MAX_IMAGE_CACHE_SIZE=100

# use spawn-fcgi to start multiwatch in fcgi mode on port 9000, which starts and manages two instances of the iipsrv
spawn-fcgi -p 9000 -- /usr/bin/multiwatch -f 2 -- /usr/lib64/cgi-bin/iipsrv.fcgi
 

nicolasfranck avatar Nov 02 '22 15:11 nicolasfranck

Unfortunately that multiwatch is not actively supported anymore, looking at the latest commits.

Anyway, is there a reason why the iipsrv is only supporting the fastcgi protocol? Why not http 1.1? Because of the lack of a widely supported http library? Of course, it is not hard to proxy the requests, but as I've experienced, every fastcgi instance acts as a single "worker" (just like an apache worker), and can only accept one request at a time, which makes it "impossible" to use keepalive connections.

nicolasfranck avatar Nov 29 '22 13:11 nicolasfranck

When we started IIPImage, there were indeed no good open source light-weight http libraries we could use. Also FastCGI is a much faster and leaner protocol than HTTP and is better adapted for connecting your front-end web server to a back-end process - using iipsrv behind a front-end web server is the way most people will want to use it.

It's true, however, that there are now good HTTP libraries available, so I guess it would be useful to have iipsrv support HTTP directly for those who want to use it without a front-end web server.

And regarding keepalive, is this not redundant if you enable HTTP/2 on your web server?

ruven avatar Dec 07 '22 14:12 ruven

I mean keep alive connections between the frontend-server and the iipsrv. Often the frontend server and the iipsrv are kept on a separate machine (for example to spread the load using an upstream in nginx)

nicolasfranck avatar Dec 07 '22 15:12 nicolasfranck

I mean keep alive connections between the frontend-server and the iipsrv.

If I've understood correctly, you're using HTTP between your front-end server and your various Nginx/iipsrv backends? In theory, it would be more efficient to use FCGI rather than HTTP between the front-end and your iipsrv processes (with no Nginx on each iipsrv machine). Going back to one of your earlier comments:

Before this, that last server instance was splitting requests to server1:9000 (iipsrv), server2:9000 .. but those were fcgi >connections, that are closed by iipsrv every time. The round robin method with fcgi only seems to work for php-fcgi, >weird..

iipsrv doesn't close the connection (at least not explicitly): FCGX_Finish_r() is only called on shutdown, not between requests. How did you monitor the connection status in this configuration? If this works for php-fcgi, maybe there's something we can do to make it work?

And, have you done any quantitative benchmarks for the different setups you've tried? It'd be interesting to measure what kind of performance you get in reality.

There's also the question of the FCGI library iipsrv uses. We use libfcgi, which is the original implementation of FastCGI, but it's now old and no longer updated. Perhaps we could consider migrating to newer libraries such as https://github.com/eddic/fastcgipp or https://cgi.sourceforge.net/?

ruven avatar Dec 07 '22 16:12 ruven

The connection between the frontend-server (nginx) and the various iipsrv instances is of course in fastcgi, how else could it be?

The connection is closed by nginx, as it uses conservative settings. If you enable keepalive for fastcgi in nginx (requires several settings), the first connection is successfully established (seen by netstat -an | grep 9000`), but subsequent requests "hang". When you use only one nginx worker, the requests never hang. It turns out that the first nginx worker/thread binds to the connection, and other nginx workers cannot use that connection until a timeout is reached. The iipsrv instance only accepts one connection at a time, so does not accept any additional connections. So it more or less acts as a "single worker". If you increase the number of iipsrv instances on that same port, and make sure that the number is greater or equal to the number of nginx workers, it works too. But what if you reuse the iipsrv instances for other frontend nginx servers?

So it is also a question of threading/instances in iipsrv itself.

I am not sure about the performance gain (seems ignorable), but the way closed tcp connections are handled (see remark about TIME_WAIT above) is not ideal. Especially when the number of incoming requests are bigger than linux can recycle the closed tcp client ports (client port is kept in state TIME_WAIT for 5 seconds or more!). In that case you eventually run out of ports in a short amount of time (seeing errors like "dns error" as it cannot connect to dns anymore):

nicolasfranck avatar Dec 07 '22 21:12 nicolasfranck

The connection between the frontend-server (nginx) and the various iipsrv instances is of course in fastcgi, how else could it be?

Sorry, I thought you were using HTTP 1.1 as you said in your previous comment:

* put iipsrv on a server instance, running on a unix socket
* put a proxy server (like nginx) in front of it on the same server. As there is no TCP stack overhead, you cannot deplete your client ports
* repeat that setup for every (replicated) server instance. So server1:80, server2:80 ...
* add a extra server instance that hosts a proxy server (again) that splits requests between the replicated instances, and make sure HTTP 1.1 is used

So, what exactly are the Nginx settings you are currently using? What you have in your first comment plus something like this?

location /iiif/ {
        fastcgi_pass iipsrv;
        fastcgi_keep_conn on;
        ...
    }

So it is also a question of threading/instances in iipsrv itself.

Yes, iipsrv uses a single main thread and so acts as a single worker (though note that it does use threading for it's image processing routines). But it should be possible to thread the main listener loop - I can't really tell if it'll be complicated or not. Maybe for version 2.0 of iipsrv!

ruven avatar Dec 08 '22 11:12 ruven

A snippet out of my config:

http {
    # cf. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive
    # no keepalive without upstream
    upstream iipsrv {
      server 127.0.0.1:9000 fail_timeout=10s;
      keepalive 10; # needed for keepalive
      keepalive_requests 50;
      keepalive_timeout 60s;
    }
    server {
        listen 80 default_server;
        listen [::]:80 default_server;
        server_name  _;
        location = /iipsrv {
          fastcgi_pass_request_headers on;
          fastcgi_pass iipsrv;
          fastcgi_connect_timeout 2s;
          fastcgi_keep_conn on; # needed for keepalive
          fastcgi_param QUERY_STRING $query_string;
          fastcgi_param REQUEST_URI $request_uri;
          fastcgi_param PATH_INFO $fastcgi_script_name;
          fastcgi_param REQUEST_METHOD $request_method;
          fastcgi_param CONTENT_TYPE $content_type;
          fastcgi_param CONTENT_LENGTH $content_length;
          fastcgi_param SERVER_PROTOCOL $server_protocol;
          fastcgi_param HTTPS $https;
          fastcgi_param IF_MODIFIED_SINCE $http_if_modified_since;
        }
    }
}

See also http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive that explains how to enable keep alive (you need several settings).

nicolasfranck avatar Dec 08 '22 12:12 nicolasfranck