freeradius-server icon indicating copy to clipboard operation
freeradius-server copied to clipboard

[defect]: connection pool uses connections that should have been closed

Open g0tar opened this issue 3 months ago • 10 comments

What type of defect/bug is this?

Unexpected behaviour (obvious or verified by project member)

How can the issue be reproduced?

I'm observing this behavior using rlm_cache_redis, not sure if this is module-specific or general. I think there are actually 2 distinct problems, but don't want to multiply issues that might have common source.

The easiest way is to deliberately misconfigure stack:

  • redis.conf: timeout 10
  • freeradius/mod-enabled/cache:
cache {
 driver = "rlm_cache_redis"
 redis {
  server = 127.0.0.1
    pool {
           start = 0
###        default settings:
###        lifetime = 0
###        retry_delay = 1
###        max_retries = 5
###        idle_timeout = 60
    }
 [...]
}

Start both servers, make them talk, wait a bit (let Redis close it's idles after 10 seconds) and make FR talk to Redis again. The connections would obviously fail, however I would expect FR to close it's end after the first failed attempt and restart fresh connection. Meanwhile such connections are being retried and fail over and over again.

Such misconfiguration isn't required, but it makes the first problem (not abandoning/fully resetting failing connections) easily visible. In fact my logs below are from Redis having timeout 100 (and this seems to be the second problem - reusing connections after idle_timeout has already passed long time ago). With this config I had to wait for 60 seconds to see ERROR logs, yet tcpdump shows there was the data actually returned from Redis (!).

Log output from the FreeRADIUS daemon

Please take a look at "connection (2)" [assuming this is some handle number] - it should be closed, as:
1. it's been rejected by Redis,
2. it's obsoleted (just like connections 4, 3, 1 and 0).

rlm_cache (cache_dhcp_accept): Reserved connection (1)
(2) cache_dhcp_accept: No cache entry found for "anonymized3"
rlm_cache (cache_dhcp_accept): Released connection (1)
[...]
rlm_cache (cache_dhcp_accept): Opening additional connection (3), 1 of 7 pending slots used
(2)     [cache_dhcp_accept] = notfound
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (0)
(3) cache_dhcp_accept: Found entry for "anonymized"
(3) cache_dhcp_accept: Setting TTL to 1000
(3) cache_dhcp_accept: Merging cache entry into request
(3) cache_dhcp_accept:   &reply:Framed-IP-Address = 10.11.12.16
rlm_cache (cache_dhcp_accept): Released connection (0)
[...]
rlm_cache (cache_dhcp_accept): Opening additional connection (4), 1 of 6 pending slots used
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (2)
(4) cache_dhcp_accept: ERROR: Failed retrieving entry for key "anonymized"
rlm_cache (cache_dhcp_accept): Released connection (2)
rlm_cache (cache_dhcp_accept): You probably need to lower "min"
rlm_cache (cache_dhcp_accept): Closing expired connection (4) - Hit idle_timeout limit
rlm_cache (cache_dhcp_accept): You probably need to lower "min"
rlm_cache (cache_dhcp_accept): Closing expired connection (3) - Hit idle_timeout limit
rlm_cache (cache_dhcp_accept): You probably need to lower "min"
rlm_cache (cache_dhcp_accept): Closing expired connection (1) - Hit idle_timeout limit
rlm_cache (cache_dhcp_accept): You probably need to lower "min"
rlm_cache (cache_dhcp_accept): Closing expired connection (0) - Hit idle_timeout limit
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (2)
(4) cache_dhcp_accept: ERROR: Failed retrieving entry for key "anonymized"
rlm_cache (cache_dhcp_accept): Released connection (2)
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (2)
(5) cache_dhcp_accept: ERROR: Failed retrieving entry for key "anonymized"
rlm_cache (cache_dhcp_accept): Released connection (2)
Need 4 more connections to reach min connections (5)
Need more connections to reach 3 spares
rlm_cache (cache_dhcp_accept): Opening additional connection (5), 1 of 9 pending slots used
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (2)
(5) cache_dhcp_accept: ERROR: Failed retrieving entry for key "anonymized"
rlm_cache (cache_dhcp_accept): Released connection (2)
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (5)
(6) cache_dhcp_accept: No cache entry found for "anonymized"
rlm_cache (cache_dhcp_accept): Released connection (5)
Need 3 more connections to reach min connections (5)
Need more connections to reach 3 spares
rlm_cache (cache_dhcp_accept): Opening additional connection (6), 1 of 8 pending slots used
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (2)
(6) cache_dhcp_accept: ERROR: Failed retrieving entry for key "anonymized"
rlm_cache (cache_dhcp_accept): Released connection (2)
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (5)
(7) cache_dhcp_accept: No cache entry found for "anonymized"
rlm_cache (cache_dhcp_accept): Released connection (5)
Need 2 more connections to reach min connections (5)
Need more connections to reach 3 spares
rlm_cache (cache_dhcp_accept): Opening additional connection (7), 1 of 7 pending slots used
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (6)
(7) cache_dhcp_accept: No cache entry found for "anonymized"
(7) cache_dhcp_accept: Creating new cache entry
(7) cache_dhcp_accept:   &reply::Framed-IP-Address += &reply:Framed-IP-Address[*] -> 10.11.12.16
(7) cache_dhcp_accept: Committed entry, TTL 1000 seconds
rlm_cache (cache_dhcp_accept): Released connection (6)
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (2)
(8) cache_dhcp_accept: ERROR: Failed retrieving entry for key "anonymized"
rlm_cache (cache_dhcp_accept): Released connection (2)
Need 1 more connections to reach min connections (5)
Need more connections to reach 3 spares
rlm_cache (cache_dhcp_accept): Opening additional connection (8), 1 of 6 pending slots used
[...]
rlm_cache (cache_dhcp_accept): Reserved connection (5)
(8) cache_dhcp_accept: Found entry for "anonymized"
(8) cache_dhcp_accept: Setting TTL to 1000
(8) cache_dhcp_accept: Told not to merge entry into request
rlm_cache (cache_dhcp_accept): Released connection (5)

Relevant log output from client utilities

$ tcpdump -npi lo tcp port 6379 -A

17:34:59.169681 IP 127.0.0.1.43366 > 127.0.0.1.6379: Flags [P.], seq 1:38, ack 1, win 260, options [nop,nop,TS val 3404957693 ecr 2568366423], length 37: RESP "GET" "anonymized"
$3
GET
$17
anonymized

17:34:59.169704 IP 127.0.0.1.6379 > 127.0.0.1.43366: Flags [R], seq 3034302121, win 0, length 0


17:38:30.390683 IP 127.0.0.1.48188 > 127.0.0.1.6379: Flags [P.], seq 83:120, ack 112, win 260, options [nop,nop,TS val 3405168914 ecr 2568585477], length 37: RESP "GET" "anonymized"
$3
GET
$17
anonymized

17:38:30.390712 IP 127.0.0.1.6379 > 127.0.0.1.48188: Flags [R], seq 1160069439, win 0, length 0

So the connections are dropped by Redis (which has higher idle timeout than FR). Still, the connection with the same number (in my case: "connection (2)") still fails to provide responses, even when I see them on the wire.

Backtrace from LLDB or GDB


g0tar avatar Sep 27 '25 15:09 g0tar

After some code reading I guess that:

  1. sticking to invalid connection, when happens, is a result of spread = 0 (picking connection released last, which apparently isn't restarted properly); enabling spread OTOH would mean bigger chance of hitting already closed connections, but at least this should rotate, so consecutive invocations might succeed. Unless the frequency of queries makes all connections invalid...
  2. There is no retry, because this applies to establishing connection, not query itself. If a query fails, it fails.
  3. Idle timeout is verified by manage_connections() -> connection_manage() only, not before using a connection. So if some overdue one is not evicted before, it might be selected for query.

g0tar avatar Sep 27 '25 17:09 g0tar

Which version is this? Looks like v3.2.x?

arr2036 avatar Sep 27 '25 17:09 arr2036

Yes, 3.2.7

g0tar avatar Sep 27 '25 17:09 g0tar

$ strace -f -enetwork radiusd -XX

Sat Sep 27 20:36:31 2025 : Debug: (2)     modsingle[authorize]: calling cache_dhcp_accept (rlm_cache)
Sat Sep 27 20:36:31 2025 : Debug: %{User-Name}
Sat Sep 27 20:36:31 2025 : Debug: Parsed xlat tree:
Sat Sep 27 20:36:31 2025 : Debug: attribute --> User-Name
Sat Sep 27 20:36:31 2025 : Debug: (2) cache_dhcp_accept: EXPAND %{User-Name}
Sat Sep 27 20:36:31 2025 : Debug: (2) cache_dhcp_accept:    --> anonymized
Sat Sep 27 20:36:31 2025 : Debug: rlm_cache (cache_dhcp_accept): Reserved connection (1)
sendto(11, "*2\r\n$3\r\nGET\r\n$17\r\nanonymized"..., 37, 0, NULL, 0) = 37
recvfrom(11, "", 16384, 0, NULL, NULL)  = 0
Sat Sep 27 20:36:31 2025 : ERROR: (2) cache_dhcp_accept: Failed retrieving entry for key "anonymized"
Sat Sep 27 20:36:31 2025 : Debug: rlm_cache (cache_dhcp_accept): Released connection (1)
Sat Sep 27 20:36:31 2025 : Info: Need 2 more connections to reach min connections (5)
Sat Sep 27 20:36:31 2025 : Info: Need more connections to reach 3 spares
Sat Sep 27 20:36:31 2025 : Info: rlm_cache (cache_dhcp_accept): Opening additional connection (3), 1 of 7 pending slots used
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 13
connect(13, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
connect(13, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
setsockopt(13, SOL_TCP, TCP_NODELAY, [1], 4) = 0
sendto(13, "*2\r\n$4\r\nAUTH\r\n$24\r\nMY_SECRET_PASSWORD_GOES_HERE"..., 45, 0, NULL, 0) = 45
recvfrom(13, "+OK\r\n", 16384, 0, NULL, NULL) = 5
Sat Sep 27 20:36:31 2025 : Debug: rlm_cache (cache_dhcp_accept): You probably need to lower "min"
Sat Sep 27 20:36:31 2025 : Info: rlm_cache (cache_dhcp_accept): Closing expired connection (2) - Hit idle_timeout limit
Sat Sep 27 20:36:31 2025 : Debug: rlm_cache (cache_dhcp_accept): You probably need to lower "min"
Sat Sep 27 20:36:31 2025 : Info: rlm_cache (cache_dhcp_accept): Closing expired connection (0) - Hit idle_timeout limit
Sat Sep 27 20:36:31 2025 : Debug: (2)     modsingle[authorize]: returned from cache_dhcp_accept (rlm_cache)
Sat Sep 27 20:36:31 2025 : Debug: (2)     [cache_dhcp_accept] = fail
Sat Sep 27 20:36:31 2025 : Debug: (2)     if (ok) {
Sat Sep 27 20:36:31 2025 : Debug: (2)     if (ok)  -> FALSE
Sat Sep 27 20:36:31 2025 : Debug: (2)     update control {
Sat Sep 27 20:36:31 2025 : Debug: (2)       &Cache-Merge := no
Sat Sep 27 20:36:31 2025 : Debug: (2)     } # update control = noop
[...]
Sat Sep 27 20:36:31 2025 : Debug: (2)       modsingle[post-auth]: calling cache_dhcp_accept (rlm_cache)
Sat Sep 27 20:36:31 2025 : Debug: %{User-Name}
Sat Sep 27 20:36:31 2025 : Debug: Parsed xlat tree:
Sat Sep 27 20:36:31 2025 : Debug: attribute --> User-Name
Sat Sep 27 20:36:31 2025 : Debug: (2) cache_dhcp_accept: EXPAND %{User-Name}
Sat Sep 27 20:36:31 2025 : Debug: (2) cache_dhcp_accept:    --> anonymized
Sat Sep 27 20:36:31 2025 : Debug: rlm_cache (cache_dhcp_accept): Reserved connection (1)
Sat Sep 27 20:36:31 2025 : ERROR: (2) cache_dhcp_accept: Failed retrieving entry for key "anonymized"
Sat Sep 27 20:36:31 2025 : Debug: rlm_cache (cache_dhcp_accept): Released connection (1)
Sat Sep 27 20:36:31 2025 : Debug: (2)       modsingle[post-auth]: returned from cache_dhcp_accept (rlm_cache)
Sat Sep 27 20:36:31 2025 : Debug: (2)       [cache_dhcp_accept] = fail

There is no sendto() near second ERROR (post-auth with Cache-Merge := no, should save the data from SQL into Redis), when first cache query failed; only authorize section seems to do the talking (and fails, reading data with Cache-Read-Only := yes to skip SQL). Why is that?

Sat Sep 27 20:55:21 2025 : Debug: (12)     modsingle[authorize]: calling cache_dhcp_accept (rlm_cache)
Sat Sep 27 20:55:21 2025 : Debug: %{User-Name}
Sat Sep 27 20:55:21 2025 : Debug: Parsed xlat tree:
Sat Sep 27 20:55:21 2025 : Debug: attribute --> User-Name
Sat Sep 27 20:55:21 2025 : Debug: (12) cache_dhcp_accept: EXPAND %{User-Name}
Sat Sep 27 20:55:21 2025 : Debug: (12) cache_dhcp_accept:    --> anonymized
Sat Sep 27 20:55:21 2025 : Debug: rlm_cache (cache_dhcp_accept): Reserved connection (7)
sendto(18, "*2\r\n$3\r\nGET\r\n$17\r\nanonymized"..., 37, 0, NULL, 0) = 37
recvfrom(18, "", 16384, 0, NULL, NULL)  = 0
Sat Sep 27 20:55:21 2025 : ERROR: (12) cache_dhcp_accept: Failed retrieving entry for key "anonymized"
Sat Sep 27 20:55:21 2025 : Debug: rlm_cache (cache_dhcp_accept): Released connection (7)
Sat Sep 27 20:55:21 2025 : Debug: rlm_cache (cache_dhcp_accept): You probably need to lower "min"
Sat Sep 27 20:55:21 2025 : Info: rlm_cache (cache_dhcp_accept): Closing expired connection (6) - Hit idle_timeout limit
Sat Sep 27 20:55:21 2025 : Debug: rlm_cache (cache_dhcp_accept): You probably need to lower "min"
Sat Sep 27 20:55:21 2025 : Info: rlm_cache (cache_dhcp_accept): Closing expired connection (5) - Hit idle_timeout limit
Sat Sep 27 20:55:21 2025 : Debug: rlm_cache (cache_dhcp_accept): You probably need to lower "min"
Sat Sep 27 20:55:21 2025 : Info: rlm_cache (cache_dhcp_accept): Closing expired connection (4) - Hit idle_timeout limit
Sat Sep 27 20:55:21 2025 : Debug: rlm_cache (cache_dhcp_accept): You probably need to lower "min"
Sat Sep 27 20:55:21 2025 : Info: rlm_cache (cache_dhcp_accept): Closing expired connection (3) - Hit idle_timeout limit
Sat Sep 27 20:55:21 2025 : Debug: (12)     modsingle[authorize]: returned from cache_dhcp_accept (rlm_cache)
Sat Sep 27 20:55:21 2025 : Debug: (12)     [cache_dhcp_accept] = fail
[...]
Sat Sep 27 20:55:21 2025 : Debug: (12)       modsingle[post-auth]: calling cache_dhcp_accept (rlm_cache)
Sat Sep 27 20:55:21 2025 : Debug: %{User-Name}
Sat Sep 27 20:55:21 2025 : Debug: Parsed xlat tree:
Sat Sep 27 20:55:21 2025 : Debug: attribute --> User-Name
Sat Sep 27 20:55:21 2025 : Debug: (12) cache_dhcp_accept: EXPAND %{User-Name}
Sat Sep 27 20:55:21 2025 : Debug: (12) cache_dhcp_accept:    --> anonymized
Sat Sep 27 20:55:21 2025 : Debug: rlm_cache (cache_dhcp_accept): Reserved connection (7)
Sat Sep 27 20:55:21 2025 : ERROR: (12) cache_dhcp_accept: Failed retrieving entry for key "anonymized"
Sat Sep 27 20:55:21 2025 : Debug: rlm_cache (cache_dhcp_accept): Released connection (7)
Sat Sep 27 20:55:21 2025 : Debug: (12)       modsingle[post-auth]: returned from cache_dhcp_accept (rlm_cache)

So with low query rate (below 1 per Redis' timeout) all queries fall back to SQL. Kind a harmless, as long as cache is only caching, however I got third step with persistent Redis as a fallback for SQL unavailability...

g0tar avatar Sep 27 '25 19:09 g0tar

Idle timeout is verified by manage_connections() -> connection_manage() only, not before using a connection. So if some overdue one is not evicted before, it might be selected for query.

I don't see that.

fr_connection_get_internal() gets a free connection from the pool. It explicitly calls fr_connection_manage() on each connection that it finds in the pool. If the connection is idle for too long, it's closed. And fr_connection_manage() returns 0 for "this is not a valid connection"

It looks like the real problem is in the rlm_cache_redis driver. The cache_entry_find() function calls redisCommand(), which can return NULL if the connection is dead. But the module doesn't close the connection when that happens. Instead, it returns an error, and leaves the connection as active.

Try the following patch, and see if that helps. There are a few cases where redisCommand() returns NULL, and the connection is left open.

If the patch works, we'll push it to v3.2.x.

patch.gz

alandekok avatar Sep 30 '25 16:09 alandekok

Thanks for looking into this. I've applied your patch on top of 3.2.7 and indeed, it seems the behavior is much more consistent with my expectations (no reuse of failed/not-restarted connections) now, however the solution is apparently not complete:

$ while sleep 15; do  radtest "anonymized" '' 127.0.0.1 0 testing123 ; done

$ radiusd -X | grep -iE 'error|cache'

(0)       &Cache-Read-Only := yes
(0) cache_dhcp_accept: EXPAND %{User-Name}
(0) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): 0 of 0 connections in use.  You  may need to increase "spare"
rlm_cache (cache_dhcp_accept): Opening additional connection (0), 1 of 10 pending slots used
rlm_cache (cache_dhcp_accept): Reserved connection (0)
(0) cache_dhcp_accept: Found entry for "anonymized"
(0) cache_dhcp_accept: Merging cache entry into request
(0) cache_dhcp_accept:   &reply:Framed-IP-Address = 10.11.12.16
rlm_cache (cache_dhcp_accept): Released connection (0)
rlm_cache (cache_dhcp_accept): Opening additional connection (1), 1 of 9 pending slots used
(0)     [cache_dhcp_accept] = ok
(0)         &SG-Service-Cache := Off
(0)     if (!&control:SG-Service-Cache) {
(0)     if (!&control:SG-Service-Cache)  -> FALSE

(1)       &Cache-Read-Only := yes
(1) cache_dhcp_accept: EXPAND %{User-Name}
(1) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (0)
(1) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(1)     [cache_dhcp_accept] = fail
(1)       &Cache-Merge := no
(1)     if (!&control:SG-Service-Cache) {
(1)     if (!&control:SG-Service-Cache)  -> TRUE
(1)     if (!&control:SG-Service-Cache)  {
(1) cache_dhcp_accept: EXPAND %{User-Name}
(1) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (1)
(1) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(1)       [cache_dhcp_accept] = fail
(1)     } # if (!&control:SG-Service-Cache)  = ok

(2)       &Cache-Read-Only := yes
(2) cache_dhcp_accept: EXPAND %{User-Name}
(2) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): 2 of 2 connections in use.  You  may need to increase "spare"
rlm_cache (cache_dhcp_accept): Opening additional connection (2), 1 of 8 pending slots used
rlm_cache (cache_dhcp_accept): Reserved connection (2)
(2) cache_dhcp_accept: Found entry for "anonymized"
(2) cache_dhcp_accept: Merging cache entry into request
(2) cache_dhcp_accept:   &reply:Framed-IP-Address = 10.11.12.16
rlm_cache (cache_dhcp_accept): Released connection (2)
rlm_cache (cache_dhcp_accept): Opening additional connection (3), 1 of 7 pending slots used
(2)     [cache_dhcp_accept] = ok
(2)         &SG-Service-Cache := Off
(2)     if (!&control:SG-Service-Cache) {
(2)     if (!&control:SG-Service-Cache)  -> FALSE

(3)       &Cache-Read-Only := yes
(3) cache_dhcp_accept: EXPAND %{User-Name}
(3) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (2)
(3) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(3)     [cache_dhcp_accept] = fail
(3)       &Cache-Merge := no
(3)     if (!&control:SG-Service-Cache) {
(3)     if (!&control:SG-Service-Cache)  -> TRUE
(3)     if (!&control:SG-Service-Cache)  {
(3) cache_dhcp_accept: EXPAND %{User-Name}
(3) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (3)
(3) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(3)       [cache_dhcp_accept] = fail
(3)     } # if (!&control:SG-Service-Cache)  = ok

(4)       &Cache-Read-Only := yes
(4) cache_dhcp_accept: EXPAND %{User-Name}
(4) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): 4 of 4 connections in use.  You  may need to increase "spare"
rlm_cache (cache_dhcp_accept): Opening additional connection (4), 1 of 6 pending slots used
rlm_cache (cache_dhcp_accept): Reserved connection (4)
(4) cache_dhcp_accept: Found entry for "anonymized"
(4) cache_dhcp_accept: Merging cache entry into request
(4) cache_dhcp_accept:   &reply:Framed-IP-Address = 10.11.12.16
rlm_cache (cache_dhcp_accept): Released connection (4)
rlm_cache (cache_dhcp_accept): Opening additional connection (5), 1 of 5 pending slots used
(4)     [cache_dhcp_accept] = ok
(4)         &SG-Service-Cache := Off
(4)     if (!&control:SG-Service-Cache) {
(4)     if (!&control:SG-Service-Cache)  -> FALSE

(5)       &Cache-Read-Only := yes
(5) cache_dhcp_accept: EXPAND %{User-Name}
(5) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (4)
(5) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(5)     [cache_dhcp_accept] = fail
(5)       &Cache-Merge := no
(5)     if (!&control:SG-Service-Cache) {
(5)     if (!&control:SG-Service-Cache)  -> TRUE
(5)     if (!&control:SG-Service-Cache)  {
(5) cache_dhcp_accept: EXPAND %{User-Name}
(5) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (5)
(5) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(5)       [cache_dhcp_accept] = fail
(5)     } # if (!&control:SG-Service-Cache)  = ok

(6)       &Cache-Read-Only := yes
(6) cache_dhcp_accept: EXPAND %{User-Name}
(6) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): 6 of 6 connections in use.  You  may need to increase "spare"
rlm_cache (cache_dhcp_accept): Opening additional connection (6), 1 of 4 pending slots used
rlm_cache (cache_dhcp_accept): Reserved connection (6)
(6) cache_dhcp_accept: Found entry for "anonymized"
(6) cache_dhcp_accept: Merging cache entry into request
(6) cache_dhcp_accept:   &reply:Framed-IP-Address = 10.11.12.16
rlm_cache (cache_dhcp_accept): Released connection (6)
rlm_cache (cache_dhcp_accept): Opening additional connection (7), 1 of 3 pending slots used
(6)     [cache_dhcp_accept] = ok
(6)         &SG-Service-Cache := Off
(6)     if (!&control:SG-Service-Cache) {
(6)     if (!&control:SG-Service-Cache)  -> FALSE

(7)       &Cache-Read-Only := yes
(7) cache_dhcp_accept: EXPAND %{User-Name}
(7) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (6)
(7) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(7)     [cache_dhcp_accept] = fail
(7)       &Cache-Merge := no
(7)     if (!&control:SG-Service-Cache) {
(7)     if (!&control:SG-Service-Cache)  -> TRUE
(7)     if (!&control:SG-Service-Cache)  {
(7) cache_dhcp_accept: EXPAND %{User-Name}
(7) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (7)
(7) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(7)       [cache_dhcp_accept] = fail
(7)     } # if (!&control:SG-Service-Cache)  = ok

(8)       &Cache-Read-Only := yes
(8) cache_dhcp_accept: EXPAND %{User-Name}
(8) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): 8 of 8 connections in use.  You  may need to increase "spare"
rlm_cache (cache_dhcp_accept): Opening additional connection (8), 1 of 2 pending slots used
rlm_cache (cache_dhcp_accept): Reserved connection (8)
(8) cache_dhcp_accept: Found entry for "anonymized"
(8) cache_dhcp_accept: Merging cache entry into request
(8) cache_dhcp_accept:   &reply:Framed-IP-Address = 10.11.12.16
rlm_cache (cache_dhcp_accept): Released connection (8)
rlm_cache (cache_dhcp_accept): Opening additional connection (9), 1 of 1 pending slots used
(8)     [cache_dhcp_accept] = ok
(8)         &SG-Service-Cache := Off
(8)     if (!&control:SG-Service-Cache) {
(8)     if (!&control:SG-Service-Cache)  -> FALSE

(9)       &Cache-Read-Only := yes
(9) cache_dhcp_accept: EXPAND %{User-Name}
(9) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (8)
(9) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(9)     [cache_dhcp_accept] = fail
(9)       &Cache-Merge := no
(9)     if (!&control:SG-Service-Cache) {
(9)     if (!&control:SG-Service-Cache)  -> TRUE
(9)     if (!&control:SG-Service-Cache)  {
(9) cache_dhcp_accept: EXPAND %{User-Name}
(9) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): Reserved connection (9)
(9) cache_dhcp_accept: ERROR: Failed connecting to Redis for key "anonymized" - ]?
(9)       [cache_dhcp_accept] = fail
(9)     } # if (!&control:SG-Service-Cache)  = ok

(10)       &Cache-Read-Only := yes
(10) cache_dhcp_accept: EXPAND %{User-Name}
(10) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): No connections available and at max connection limit
(10)     [cache_dhcp_accept] = fail
(10)       &Cache-Merge := no
(10)     if (!&control:SG-Service-Cache) {
(10)     if (!&control:SG-Service-Cache)  -> TRUE
(10)     if (!&control:SG-Service-Cache)  {
(10) cache_dhcp_accept: EXPAND %{User-Name}
(10) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): No connections available and at max connection limit
(10)       [cache_dhcp_accept] = fail
(10)     } # if (!&control:SG-Service-Cache)  = ok

(11)       &Cache-Read-Only := yes
(11) cache_dhcp_accept: EXPAND %{User-Name}
(11) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): No connections available and at max connection limit
(11)     [cache_dhcp_accept] = fail
(11)       &Cache-Merge := no
(11)     if (!&control:SG-Service-Cache) {
(11)     if (!&control:SG-Service-Cache)  -> TRUE
(11)     if (!&control:SG-Service-Cache)  {
(11) cache_dhcp_accept: EXPAND %{User-Name}
(11) cache_dhcp_accept:    --> anonymized
rlm_cache (cache_dhcp_accept): No connections available and at max connection limit
(11)       [cache_dhcp_accept] = fail
(11)     } # if (!&control:SG-Service-Cache)  = ok
^C

So I've hit No connections available and at max connection limit condition while there should be no active connections (all dropped after failing). This tested only the 1st chunk of the patch (the Failed connecting to Redis for key codepath) (while second chunk includes talloc_free(pool);)

g0tar avatar Oct 01 '25 15:10 g0tar

It looks like nothing is calling fr_connection_close(). I'll try a patch shortly

alandekok avatar Oct 01 '25 16:10 alandekok

Please try this patch, which should close the broken connection, and try to reconnect it.

patch.gz

alandekok avatar Oct 01 '25 18:10 alandekok

I had to use c->key instead of key in 2nd chunk, but now it looks really good:) I got test loop running for 10 minutes and everything is being handled properly.

g0tar avatar Oct 01 '25 19:10 g0tar

I'm pushing the limits and all broken connections are reconnected. One question though - assuming all connections are being broken and reconnected, is this the expected loop (under extremely short timeouts configured on purpose for this test):

rlm_cache (cache_dhcp_accept): Reserved connection (8)
(148) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (8)
rlm_cache (cache_dhcp_accept): Released connection (8)

rlm_cache (cache_dhcp_accept): Reserved connection (12)
(149) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (12)
rlm_cache (cache_dhcp_accept): Released connection (12)

rlm_cache (cache_dhcp_accept): Reserved connection (10)
(150) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (10)
rlm_cache (cache_dhcp_accept): Released connection (10)

rlm_cache (cache_dhcp_accept): Reserved connection (9)
(151) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (9)
rlm_cache (cache_dhcp_accept): Released connection (9)

rlm_cache (cache_dhcp_accept): Reserved connection (11)
(152) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (11)
rlm_cache (cache_dhcp_accept): Released connection (11)

rlm_cache (cache_dhcp_accept): Reserved connection (8)
(153) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (8)
rlm_cache (cache_dhcp_accept): Released connection (8)

rlm_cache (cache_dhcp_accept): Reserved connection (12)
(154) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (12)
rlm_cache (cache_dhcp_accept): Released connection (12)

rlm_cache (cache_dhcp_accept): Reserved connection (10)
(155) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (10)
rlm_cache (cache_dhcp_accept): Released connection (10)

rlm_cache (cache_dhcp_accept): Reserved connection (9)
(156) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (9)
rlm_cache (cache_dhcp_accept): Released connection (9)

rlm_cache (cache_dhcp_accept): Reserved connection (11)
(157) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (11)
rlm_cache (cache_dhcp_accept): Released connection (11)

rlm_cache (cache_dhcp_accept): Reserved connection (8)
(158) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (8)
rlm_cache (cache_dhcp_accept): Released connection (8)

rlm_cache (cache_dhcp_accept): Reserved connection (12)
(159) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (12)
rlm_cache (cache_dhcp_accept): Released connection (12)

rlm_cache (cache_dhcp_accept): Reserved connection (10)
(160) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (10)
rlm_cache (cache_dhcp_accept): Released connection (10)

rlm_cache (cache_dhcp_accept): Reserved connection (9)
(161) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (9)
rlm_cache (cache_dhcp_accept): Released connection (9)

[...]
rlm_cache (cache_dhcp_accept): Reserved connection (11)
(177) cache_dhcp_accept: Reconnecting...
rlm_cache (cache_dhcp_accept): Reconnecting (11)
rlm_cache (cache_dhcp_accept): Released connection (11)

i.e. (in this case) going over 8 -> 12 -> 10 -> 9 -> 11 sequence repeatedly (having min = 5)? With spread = 0 I see the last_reserved_cmp() being used (otherwise last_released_cmp()), so if I understand this correctly, this confirms that connections are being properly reinitialized.

If so, it looks to me to solve both issues I've reported here. Thank you for the fix!

Oh, and just to make this subject complete - idle connections could be proactively tested (and L5-keepalived) via redis PING command.

g0tar avatar Oct 01 '25 19:10 g0tar