lua-nginx-module icon indicating copy to clipboard operation
lua-nginx-module copied to clipboard

Semaphore not cleaned up on request timeout

Open dark264sh opened this issue 7 months ago • 8 comments

After a ngx.semaphore.wait() call, if the current request is terminated by Nginx due to client_header_timeout or client_body_timeout, the corresponding post() is never executed. As a result, all subsequent attempts to wait() on the same semaphore hang until they time out, because the semaphore count was never restored.

1. Reproduce by client_header_timeout on ssl_certificate_by_lua_block


#user  nobody;
worker_processes  1;
daemon off;
master_process off;

error_log  logs/debug.log  debug;

#pid        logs/nginx.pid;


events {
    worker_connections  1024;
}

http {
    lua_code_cache on;

    include       mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  logs/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  65;

    #gzip  on;

    # Shared dictionary for the balancer
    lua_shared_dict my_balancer_dict 1m;

    init_worker_by_lua_block {
        local semaphore = require "ngx.semaphore"
        -- create a semaphore with capacity = 1
        local sema, err = semaphore.new(1)
        if not sema then
            ngx.log(ngx.ERR, "failed to create semaphore: ", err)
            return
        end
        ssl_semaphore = sema
    }

	server {
		listen       8080;
		server_name localhost;

		location / {
			content_by_lua_block {
				ngx.sleep(10)
				ngx.say("Hello, world!")
			}
		}
	}

    server {
        listen       8443 ssl;
		server_name edge.foo.com;

		client_header_timeout 8;

        # your existing certs:
		ssl_certificate     /home/barca/src/d26-sandbox/go/data/server.crt;
		ssl_certificate_key /home/barca/src/d26-sandbox/go/data/server.key;

        ssl_certificate_by_lua_block {
            local sema = ssl_semaphore
            local ok, err = sema:wait(5)
            if not ok then
                ngx.log(ngx.ERR, "SEMA] semaphore wait failed: ", err)
                return ngx.exit(ngx.ERROR)
            end

            ngx.log(ngx.INFO, "SEMA] acquired semaphore, sleeping 30s")

	        local httpc = require "resty.http"
	        local http_client, err = httpc.new()
	        if not http_client then
	            ngx.log(ngx.ERR, "failed to create http client: ", err)
	            sema:post(1)
	            return ngx.exit(ngx.ERROR)
	        end
			--http_client:set_timeouts(10000, 10000, 10000) -- connect, send, read timeouts
	
	        local request_uri_to_call = "http://127.0.0.1:8080/some/path"
	        local res, err = http_client:request_uri(request_uri_to_call, {
	            method = "GET",
	        })
	
	        if not res then
	            ngx.log(ngx.ERR, "failed to make HTTP request to ", request_uri_to_call, ": ", err)
	            return ngx.exit(ngx.ERROR)
	        end
	
	        ngx.log(ngx.INFO, "HTTP request to ", request_uri_to_call, " completed with status: ", res.status)
	        local ok, err = http_client:set_keepalive()
	        if not ok then
	            ngx.log(ngx.ERR, "failed to set keepalive: ", err)
	        end


            local posted, perr = sema:post(1)
            if not posted then
                ngx.log(ngx.ERR, "SEMA] semaphore post failed: ", perr)
            else
                ngx.log(ngx.INFO, "SEMA] released semaphore")
            end
        }

        location / {
			return 200 "Hello, world!";
        }
    }
}

The first request times out in the SSL phase and never calls post(). All subsequent requests hang until their wait() deadlines.

curl -v -o /dev/null --connect-to edge.foo.com:443:127.0.0.1:8443 'https://edge.foo.com/' --insecure
curl -v -o /dev/null --connect-to edge.foo.com:443:127.0.0.1:8443 'https://edge.foo.com/' --insecure
curl -v -o /dev/null --connect-to edge.foo.com:443:127.0.0.1:8443 'https://edge.foo.com/' --insecure

2. Reproduce by client_body_timeout on access_by_lua_block


#user  nobody;
worker_processes  1;
daemon off;
master_process off;

#error_log  logs/error.log;
#error_log  logs/error.log  notice;
#error_log  logs/error.log  info;
error_log  logs/debug.log  debug;

#pid        logs/nginx.pid;


events {
    worker_connections  1024;
}

http {
    lua_code_cache on;

    include       mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  logs/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  65;

    #gzip  on;

    # Shared dictionary for the balancer
    lua_shared_dict my_balancer_dict 1m;

	init_worker_by_lua_block {
	    local semaphore = require "ngx.semaphore"
	    local sema, err = semaphore.new(1)
	    if not sema then
	        ngx.log(ngx.ERR, "failed to create semaphore: ", err)
	        return
	    end
	    upload_semaphore = sema
	}

    client_body_timeout   5s;

    server {
        listen       8000;
		server_name edge.foo.com;

        # on 408 jump here
        error_page 408 = @body_timeout;

        location /upload {
            access_by_lua_block {
				local sema = upload_semaphore
                local ok, err = sema:wait(10)
                if not ok then
                    ngx.log(ngx.ERR, "SEMA] semaphore wait failed: ", err)
                    return ngx.exit(500)
                end
                ngx.log(ngx.INFO, "SEMA] acquired semaphore, now reading body")

                ngx.req.read_body()
                local data = ngx.req.get_body_data() or ""

                sema:post(1)
                ngx.log(ngx.INFO, "SEMA] released semaphore")

                ngx.say("Got ", #data, " bytes of body")
            }
        }

        location @body_timeout {
            return 408 "Request Timeout: client body not received within 5s\n";
        }
    }
}

The first slow upload triggers client_body_timeout and never calls post() Subsequent uploads hang at wait()

(   printf 'POST /upload HTTP/1.1\r\n';   printf 'Host: edge.foo.com\r\n';   printf 'Content-Length: 100\r\n';   printf 'Connection: close\r\n';   printf '\r\n';    sleep 10;    yes | head -c 100; ) | nc 127.0.0.1 8000
(   printf 'POST /upload HTTP/1.1\r\n';   printf 'Host: edge.foo.com\r\n';   printf 'Content-Length: 100\r\n';   printf 'Connection: close\r\n';   printf '\r\n';    sleep 3;    yes | head -c 100; ) | nc 127.0.0.1 8000
(   printf 'POST /upload HTTP/1.1\r\n';   printf 'Host: edge.foo.com\r\n';   printf 'Content-Length: 100\r\n';   printf 'Connection: close\r\n';   printf '\r\n';    sleep 3;    yes | head -c 100; ) | nc 127.0.0.1 8000

Proposed Solution Register a cleanup handler (as done in ngx_http_file_cache or ngx_http_lua_socket_tcp) when acquiring the semaphore. If the request is aborted, this handler should call sema:post() to restore the count.

dark264sh avatar May 21 '25 09:05 dark264sh

Would you please create a PR for this issue?

zhuizhuhaomeng avatar Jun 02 '25 09:06 zhuizhuhaomeng

Sounds good, I'll create a pull request.

dark264sh avatar Jun 03 '25 23:06 dark264sh

@zhuizhuhaomeng I sent the pull requests for this issue.

lua-nginx-module: https://github.com/openresty/lua-nginx-module/pull/2426 lua-resty-core: https://github.com/openresty/lua-resty-core/pull/502

I didn't send for stream-lua-nginx-module because there is a caution comment like this. I cannot find this file, I don't know how to handle it.

/*
 * !!! DO NOT EDIT DIRECTLY !!!
 * This file was automatically generated from the following template:
 *
 * src/subsys/ngx_subsys_lua_semaphore.c.tt2
 */

dark264sh avatar Jun 05 '25 02:06 dark264sh

From https://github.com/openresty/lua-resty-core/blob/master/lib/ngx/semaphore.md#synchronizing-threads-in-different-contexts, there is : "The same applies to different request contexts as long as these requests are served by the same nginx worker process."

Does you change break this?

zhuizhuhaomeng avatar Jun 11 '25 01:06 zhuizhuhaomeng

@zhuizhuhaomeng Good catch!

In this scenario we’re using ngx.timer.at with a closure before acquiring the semaphore. It seems to work initially, but problems occur when ngx.timer.at runs with a closure after semaphore acquisition.

I’m not sure how to resolve this yet. But this seems similar to this issue.

From https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#ngxtimerat There is: "You can pass most of the standard Lua values (nils, booleans, numbers, strings, tables, closures, file handles, etc.) into the timer callback, either explicitly as user arguments or implicitly as upvalues for the callback closure. There are several exceptions, however: you cannot pass any thread objects returned by coroutine.create and ngx.thread.spawn or any cosocket objects returned by ngx.socket.tcp, ngx.socket.udp, and ngx.req.socket because these objects' lifetime is bound to the request context creating them while the timer callback is detached from the creating request's context (by design) and runs in its own (fake) request context. If you try to share the thread or cosocket objects across the boundary of the creating request, then you will get the "no co ctx found" error (for threads) or "bad request" (for cosockets). It is fine, however, to create all these objects inside your timer callback."

dark264sh avatar Jun 11 '25 08:06 dark264sh

@dark264sh You can save the sem state in ngx.ctx, and run sem:post() at log_by_lua in case the timeout occur. If exception occur, the request may run into subrequest. Eg: lua error, 502 error .... In this case, the saved state in ngx.ctx will be clear. You need to restore the ngx.ctx. See https://github.com/Kong/kong/blob/master/kong/resty/ctx.lua for more details about the restoring of ngx.ctx.

zhuizhuhaomeng avatar Jun 24 '25 07:06 zhuizhuhaomeng

@zhuizhuhaomeng

Hi there,

Thanks for taking the time to review this. I’m not entirely clear on your intention, so could you clarify whether you’re suggesting:

  1. The issue is better solved through a runtime configuration with Kong's ctx.lua (and therefore don’t need a PR), or
  2. This PR should be updated to incorporate your suggestions?

I apologize if I've misunderstood your point. Any additional details you could provide would be greatly appreciated.

If it’s the second option, could you provide more detail on the suggested changes?

dark264sh avatar Jun 25 '25 05:06 dark264sh

Your PR has caused the functionality to fail, which is unacceptable.

My recommendation is to make modifications at the business code level.

zhuizhuhaomeng avatar Jun 26 '25 13:06 zhuizhuhaomeng